[Jprogramming] Beginner--more questions about textual data handling

PackRat Tue, 05 Feb 2008 21:30:01 -0800

[The original subject line was: Beginner--how to change space separator 
to LF?]

Ric Sherlock wrote:
> If you give a simple example of the initial CRLF-separated lists and the
> format of a vector post-processing it may be possible to simplify
> further.

Again, thanks to everyone for the several responses!

I previously responded:
> OK, here's a simple example:

Ah, if life were only so simple!  In subsequently working with the 
actual data I need to import, I discovered there were several 
variations in the format of the data, which creates some complications 
in creating a more general solution.  My general question is whether I 
need to have multiple scripts to handle each variation, or is there a J 
way to accommodate the variations so that the J vector/array will end 
up being the same, regardless of the input variation?

First of all, here's my starting script that will input a file of data 
with the format: number<CR><LF>number<CR><LF>etc....

===================================================

require 'stdlib files'

NB.  set operations:
sort =: /:~
sortdown =: \:~
dedupe =: ~.
setor =: ,
setand =: e. # [
setnot =: -.

list1 =: x: ". 'm' fread < 'C:\rfile1.txt'
list2 =: x: ". 'm' fread < 'C:\rfile2.txt'
NB.  list1 =: 's' fread < 'C:\rfile1.txt'
NB.  list2 =: 's' fread < 'C:\rfile2.txt'

list1 =: dedupe sort list1
list2 =: dedupe sort list2

list3 =: list1 setnot list2
list3 =: dedupe sort list3

'courtesy of Henry Rich (J Programming Forum):
list3 =: ; (LF ,~ ":)&.> list3

(toHOST list3) fwrite < 'C:\rfile3.txt'

===================================================

My original "read" command was:
   list1 =: 'm' fread < 'C:\rfile1.txt'

Frankly, I don't remember why I added ". to that statement: 
   list1 =: ". 'm' fread < 'C:\rfile1.txt'
Maybe seeing an example or something??  As I recall, it had to do with 
converting characters to numeric values, but, as I look again at the 
Dictionary, I don't see either the monadic or dyadic definitions 
fitting the situation.  Dyadic would seem to be what I was looking for, 
but there's no lefthand value ahead of the ". verb.

I *do* know that I had to add x: because the numbers being read in were 
long: they were 14-digit library barcodes.  What's interesting is that 
*ALL* the documentation says J will handle up to about 16 digits 
without flipping over to exponential notation, yet it failed already 
with 14 digits.  As I said, interesting.

That verb sequence worked to read in that particular data variation.  
However, other data (which *appears* similar, but apparently isn't) 
failed to read in correctly.  (As I recall, "Domain error" was 
generated.)  This data had the following format: 
   <">b<number><"><CR><LF><">b<number><"><CR><LF>etc....
In other words, the file data looked like this (file includes quotes!):
   "b15649131"
   "b15649192"
   "b1564926x"
Well, *this* presented several challenges!  Since the earlier file also 
contained characters, I thought J would handle this data if I went back 
to the "non-numeric" reading of data:
   list1 =: 'm' fread < 'C:\rfile1.txt'

However, apparently 'm' requires *numeric* data??  I switched the flag 
to 's' (as in the NB. lines) and the read worked OK (that is, the J 
data looked like the 3 examples above).  But now I was faced with, "How 
do I get rid of the extraneous quotation marks?"  Or, eventually, 
perhaps the letter "b" as well?  That's where I'm stuck at now.

I know that J works wonderfully with numeric data, but any programming 
language ought to work just as well with textual data.  (There are an 
awful lot of textual files and databases out there for manipulation and 
data mining.)  As a beginner, I found it extremely challenging to find 
J help for handling and manipulating "real" *variable-length* textual 
data (in vectors and arrays) rather than numeric data.  I can't seem to 
find verbs in J that are equivalent to the following Visual Basic-like 
commands:

StringLeft(stringID,number) : return the leftmost <number> characters 
of <stringID>

StringRight(stringID,number) : return the rightmost <number> characters 
of <stringID>

StringMid(stringID,startpos,number) : return <number> characters of 
<stringID>, starting at the <startpos> character of <stringID>; if 
<number> is omitted, return the remainder of <stringID>, starting at 
the <startpos> character of <stringID>

syntax #2:
StringMid(stringID,startpos,number) = <stringID2> : starting at the 
<startpos> character of <stringID>, replace <number> characters of 
<stringID> with the first <number> characters of <stringID2>; if 
<number> is omitted, replace the remaining characters of <stringID> 
with the characters of <stringID2>; by the way, <stringID2> can be a 
string identifier or a literal string; important note: the replacement 
of characters can never go beyond the length of <stringID>!

"Left", "Right", and "Mid" (both forms) are *extremely* important for 
textual manipulation.  Are there J equivalents for these?

Another question I see coming up shortly is how do I get J to accept 
the fact that a terminating "x" or "X" (in the above numbers, for 
example, or in book ISBNs) is a valid "numeric" character, being the 
result of a base-11 check-digit algorithmic calculation?  Or do I have 
to consider these "numbers" as *strings* (of characters) instead?

And, if I need to think/program in terms of strings (I presume this 
means boxed data?), will the set operations above work on boxed data, 
too, or are other definitions needed for boxed textual data?  These set 
operations are extremely important for what I wish to use J for at the 
moment.  (My earlier experiments with this script seemed to indicated 
that you couldn't sort boxed data or perform set operations on the 
boxed data.  On the other hand, my J knowledge is so meager at the 
moment that there might be ways, but I just don't know about them yet.)

And one more question (for now!) about data massaging in J: it turns 
out that another variation in the data is that the first data item in 
many of the files I wish to read is a column header such as the 
following:
   "RECORD #(BIBLIO)"
Our local library automation system exports database data with column 
headers so that it's easy to import the data into MS Excel.  However, I 
want to import it into J.  How can I program J to read a file, 
*omitting* the first data value?  (That is, without throwing an error 
message because the column header is different from the rest of the 
data in the file?)  Or do I have to write such preliminary "data 
cleanup" routines in another programming language first, because J 
can't handle it?  (I really would rather be able to do everything in J, 
if possible.)  I should note that the data written back out at the end 
needs neither a column header nor quotation marks nor a recordtype 
prefix character (the "b" in the above sample data), although it might 
be nice to know if those export additions are possible.

Again, as previously, any help, guidance, and insights would be very 
much appreciated!  Thanks in advance!

Harvey

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] Beginner--more questions about textual data handling

Reply via email to