RE: [Jprogramming] Beginner--more questions about textual data handling

PackRat Wed, 06 Feb 2008 17:15:06 -0800

Bill, Rob, and Ric: 

THANK YOU! THANK YOU! THANK YOU!  The information you provided was 
great!  It'll keep me busy for a while, learning how it all works (and 
why) and applying it to my current needs.

Bill Lam wrote:
> PackRat wrote: 
> > "Left", "Right", and "Mid" (both forms) are *extremely* important 
> > for textual manipulation.  Are there J equivalents for these?
> 
> This is untrue, J works extremely well for data processing.

My comments seem not to have come across as clearly as I had thought.  
I was *NOT* knocking J by any means; it's an extremely powerful 
language, and I am constantly amazed (from the other J forums I belong 
to, but especially this one) at what it can do.  I was noting that 
every programming language (including J) is normally expected to be 
able to handle both numeric and textual data.  *My* problem was that I 
was having difficulty finding descriptions and examples of textual 
handling because the documentation seems so overwhelmingly numerically 
oriented.  (And I don't object to that--it's great, but I'd also like 
to see more documentation on how to work with variable-length textual 
data.)  I presumed that J, of course, could do these kinds of things--
it's just that I was having a hard time trying to find the information, 
or to put together the "2" and "2" that are out there to come up with 
the "4" that I was looking for.  As I said, "On the other hand, my J 
knowledge is so meager at the moment that there might be ways, but I 
just don't know about them yet."  Thanks for your examples of how to do 
the Left/Right/Mid string thing!

> The reason why you can not find any StringRight, StringMid is that
> it is too trivial to define cover verbs for them.

But wouldn't it be helpful to have references/pointers to these kinds 
of things in the documentation for the sake of people coming from other 
programming languages?  Having worked with non-mathematically-oriented 
adult beginners in programming, I'm very sensitive to the needs of 
beginning learners.  The cardinal rule of teaching is to go from the 
known to the unknown; you don't drop a learner in the middle of the 
ocean and say, "Sink or swim!"  To say in the documentation for "Take", 
for example, "For positive values of x, this is the equivalent of Left 
or StringLeft commands in other programming languages" would be both 
searchable and helpful for those new to J.

> Please read documentation on verb { {. }. and conjunction } for the
> details. 

Very helpful--thanks!  I just wish the "typical" uses were more 
obviously demonstrated.  The textual examples (particularly with 
"Take") showed exceptional cases (for example, reversed direction 
overtake) rather than the norm.  Beginners need examples of the norm.

> J is more powerful than VB. how would you do this in VB?
>        5 3 2 1{'abcdef'  => 'fdcb'

I'd do it with a series of concatenated Mid$ functions (the whole 
statement would be rather verbose and lengthy)--J obviously is far more 
concise and flexible.  To create a "general case" in VB (where the "5 3 
2 1" could be variable in the number of values present) would be rather 
more complex, whereas it's really quite simple in J--once I know what 
I'm doing! ;-)

Rob Hodgkinson wrote:
> When data is displayed it is according to the Print Precision (see
> 9!:10 '' to view the Print Precision, or see menu Edit/Configure...
> Then the Parameters tab, to set the Print Precision).  This changes
> the point at which integers are displayed in exponential notation. 

Thanks--I didn't know that tidbit!  Boy, it sure would have been nice 
if there had been a "see also" cross reference to "Print Precision" in 
the x: (Extended Precision) monadic verb (and vice versa).  As noted in 
the current "j docs" thread in the Chat forum, current internal cross 
referencing leaves a lot to be desired.  (As a cataloging librarian, 
such cross referencing in a library catalog is the "bread and butter" I 
deal with on a daily basis, and so it's very frustrating at times to be 
dealing with information where such linkages are lacking.)

> Since you now indicate the data is alphanumeric, then probably best
> to keep it as character.  You can still sort the characters. 
> Here is an example... [omitted]
> The key here is that it is not clear from your 'instances' of data
> what the 'general' rules are for all your data. 

Well, let me describe the various data I'm currently working with in 
this way (these are all vectors at this point; in the future I hope to 
move on to textual arrays):

(1) Sometimes (as with my initial examples), I've already pre-massaged 
the data, creating files that have purely numeric values in them (no 
quotation marks, no recordtype prefixes).

(2) Sometimes, the "raw" exported data in the files might be the exact 
same data, except that the numeric values are enclosed within quotation 
marks (to make it easy to import into MS Excel, I presume).

(3) Sometimes, the exported data is alphanumeric (a recordtype 
alphabetic character followed by completely numeric data), enclosed 
within quotation marks (most likely for easy Excel import), unless I 
may have pre-massaged the data.  This data is a bit special, because 
the numeric portions are the unique key identifiers of records in a 
database, the prefix indicating which database: b=bibliographic, 
i=item, p=patron, o=order, etc.  Since every item in the data file has 
the same prefix, it's not really necessary for work with J, and, for 
file writing purposes, it's not needed in the output file either--
that's why I asked about the possibility of getting rid of it, too.

(4) Sometimes, the exported data is purely textual, enclosed within 
quotation marks as textual delimiters (again, most likely for easy 
Excel import).

NOTE1: The numeric portions of data classes 1, 2, and 3 above can end 
with "x" (or "X") as a check digit for the accuracy of the remainder of 
the number.  This is most common for the record identifiers and for 10-
digit ISBN values on books.  ("X" stands for a remainder of 10 when 
using a MOD 11 algorithm, equivalent to "Residue" in J.)

NOTE2: The "raw" data files (that is, the ones I have not pre-massaged) 
also have a first item that is a textual column header, even if the 
remaining items are purely numeric (though enclosed within quotation 
marks).  This is for ease of import into Excel.

Do I understand you correctly when you say I should be able to sort, 
dedupe, and do set union, set intersection, and set exclusion if I were 
to just use character vectors/arrays all the way through?  I thought I 
had tried those set operations early on with no success.  (On the other 
hand, maybe my verb sequences weren't correct back then.)

> This whole process will be less painless if you could do the following...
> * Supply sample input file with a subset of rows that fully describe
> the data (ie 15 rows, each one an instance of all the different
> types?) 

I believe this is a list of all possible variations of the data I'm 
currently dealing with:

31184017063376  [14-digit barcode, pre-massaged]
"31184017063376"  [same as above, but with quotation marks]
1895721156  [10-digit book ISBN]
"1895721156"  [same as above, but with quotation marks]
15649131  [8-digit record identifier, pre-massaged]
b15649131  [same as above, but with recordtype prefix]
"b15649131"  [same as above, but with quotation marks]
047126847X  [10-digit book ISBN with check digit "X"]
"047126847X"  [same as above, but with quotation marks]
1564926x  [pre-massaged 8-digit record ID with check digit "x"]
b1564926x  [same as above, but with recordtype prefix
"b1564926x"  [same as above, but with quotation marks]
"AUTHOR: Iverson, Kenneth E."  [a single item of pure textual data]

The data above that is enclosed with quotation marks also has a "non-
data" textual column header as the first datum in its file.  Here's an 
example of that datum:

"RECORD #(BIBLIO)"

> * Specify what you want to do to that data, how to handle 'b123x' etc
> * Specify how you want it written out
> Perhaps a precise solution could then be offered and you could query the
> different ways a solution is achieved.

Essentially, the data should look like the examples above without 
quotation marks and without a recordtype prefix, but containing an "x" 
check digit, if it exists.  This is the data that would be manipulated 
with set-related operations and which would be exported (written to 
disk).

Ric Sherlock wrote:
> Yes ". will convert a literal number to a numeric one, but the dyadic
> version is faster and more specific. See
> http://www.jsoftware.com/jwiki/Guides/General_FAQ/Numbers_and_Character_
> Representations

Boy, information sure is scattered all over the place, isn't it?  
Again, here is where it would be useful to have "see also" cross 
references between these various locations in the documentation.

I wrote:
> > However, apparently 'm' requires *numeric* data??  
> 
> No, 'm' fread 'c:\rfile1.txt' will work fine with literal data.

That's good to know.  For whatever reason, I was getting errors that 
made me think that it might work only with numbers.

> Boxing strings is most useful for strings of unequal lengths.

That's what I thought, and it's good to know for some future data 
endeavors in mind (textual arrays).

> You can sort boxed data. ...[examples omitted]
> If you want to drop the double quotes in the first and last columns you
> could do }.@:}:"1 tmp ...[example omitted]
> If your values are equal length I'd read the file into a text array
> (matrix) using
>   tmp=. 'm' fread <filename>
> If they are unequal length then a better option would be to read the
> file into a boxed list using
>   tmp=. 'b' fread <filename>
> After reading into a noun using either of the above methods, you can
> drop the first one using }.
> You could test to see if there is a column header as the first record
> and only drop it if it is, for example:
>    tmp=. (+./'"RECORD' E. {.tmp)}.tmp  NB. use with array
>  or
>    tmp=. (+./'"RECORD' E. 0{::tmp)}.tmp NB. use with boxed list

Wow!  Great information!!  Thanks!

> One of the things that is nice about J is that many primitives will
> work with arrays whether they are numeric or literal. 

That's what I figured, but I just was having a darned hard time trying 
to find information about textual vectors/arrays.

Again, thanks to you all for giving me so many leads to work with!  All 
of this information has been *SO* helpful!!

Harvey

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

RE: [Jprogramming] Beginner--more questions about textual data handling

Reply via email to