Re: [Jprogramming] Further question re: amending

Hahn, Harvey Tue, 23 Dec 2008 12:35:17 -0800

Dan Bron wrote:
|I'm going to try to rephrase your question.  Let me know if I've got it
|right.


Close, but no cigar--see below.

|Your input is a several tables of ID,Value pairs and one 
|canonical list of all IDs.

That's correct, although the first part of the input you note could just
as typically be one table only.  Additionally, as I note in my next
response below, the "canonical list" could also be a full table with an
ID column and multiple data value columns.

|Your desired output is another ID,Value table, whose first
|column is identical to the input list of all IDs, and whose last column
|has the corresponding values from the input tables (or zero, 
|if no value corresponded to the ID).

This is not what I'm looking for.  Rather, the "value" part of a new
"ID,value" table (or multiple such tables, if you desire) is appended as
an ADDITIONAL column onto the existing table, but only when the "ID"
part matches the first ("key") column; the value associated with
nonmatches is zero (or some other designated value, depending on the
situation).  In other words, the original table is an Nx1 table of ID's
only, or it could be an Nx(D+1) table, where D is the number of values:
"ID,val1,val2,val3,etc."  Basically, an existing table "ID,val1,val2"
would become "ID,val1,val2,val3" after the joining operation.  Likewise,
the simpler case "ID" would become "ID,value" after the joining.  (See
the end of this message for some specific examples.)

|You prefer a multiline explicit verb solution, because at this point in
|your J career, that will be easier for you to read, maintain, and
extend.

Most definitely!!  I have to say, though, that after struggling with J
programming for a year or so now, I can understand and create J programs
that I was not able to a year ago--so I'm making progress little by
little.  To a small extent, I'm starting to "think in J", but there's a
LONG way to go yet.  The thing I fail to think of most often (due to
using procedural programming languages since the 1970's) is "'doing it
all at once' [as] a common pattern in J", as you note below.

|Assuming the above, then does the following suit?

Partially yes, partially no.

|... you treat  tblall  as one-column table, and append a
|column of zeros, 

I needed to do the zero column thing (and you may recall my earlier
question about how to do that) in order to make Oleg's original tacit
expression work correctly--that is, values needed to exist in the table
for those keys which didn't match the ID's and also which could be
overlaid when there was a match.  More recently, as I had more time and
opportunity again to devote to this particular application, Oleg's
expression started failing, and so I requested an explicit definition
(as Henry Rich offered) so that I could better track down why the
failures were occurring.  (I'm not very good yet at understanding tacit
expressions, although Henry's rendition made Oleg's original code much
more understandable to me.)

|then merge the values from the other tables into that column.

No--instead, merge them onto SEPARATE columns, not all onto the SAME
column (although that's interesting in itself, and I'll have to keep
your technique on hand for possible future use).

|I find } is relatively uncommon in J code.  While it has elegant 
|uses, most of the familiar ones from scalar languages are expressed 
|differently in J (to our benefit).  

My usage merely follows that of the "masters"; the couple of times that
I've used it in various verbs I've created has usually been under
advisement from members of the forum.  How would one "express it
differently in J", in terms of selectively modifying data in a table?

|Also, you merge multiple tables in multiple steps, whereas the above 
|code treats the aggregation of all ID,Value tables as a single input 
|(the tables agree in type and shape, and that fact can be leveraged).
|"Doing it all at once" is a common pattern in J.

The application I'm working on at the moment happens to use only two
tables of values output from our library database--but the values need
to be in SEPARATE columns because they represent two different kinds of
values.

|Note that i. will give preference to the values in tbl1 .  If you want
|to give preference to tbl2 then put it first (i.e. use tbl2,tbl1 ).

The data in tbl1 and tbl2 are different (in a different context, it's
like March data is different from April data, even though they're both
numbers or money), and so the values in these two tables CANNOT be
merged into a single value column.

Here is what the IDEAL end result of this joining should look like:
+---------+----------+----------+
|b18934225|08-28-2008|12-17-2008|
+---------+----------+----------+
|b18934286|08-28-2008|0         |
+---------+----------+----------+
|b18934304|0         |10-21-2008|
+---------+----------+----------+
|b18934468|09-01-2008|0         |
+---------+----------+----------+
|b18935618|0         |0         |
+---------+----------+----------+
|b1893741x|10-14-2008|11-25-2008|
+---------+----------+----------+

Running your verb resulted in this, where all data is merged in column
1:
+---------+----------+
|b18934225|08-28-2008|
+---------+----------+
|b18934286|08-28-2008|
+---------+----------+
|b18934304|10-21-2008|
+---------+----------+
|b18934468|09-01-2008|
+---------+----------+
|b18935618|0         |
+---------+----------+
|b1893741x|10-14-2008|
+---------+----------+

The end result of running your verb in two steps, "tbla=. tblall join
tbl1" and then "tblb=. tbla join tbl2", was closer to my intent but had
an extra column of zeros:
+---------+----------+----------+-+
|b18934225|08-28-2008|12-17-2008|0|
+---------+----------+----------+-+
|b18934286|08-28-2008|0         |0|
+---------+----------+----------+-+
|b18934304|0         |10-21-2008|0|
+---------+----------+----------+-+
|b18934468|09-01-2008|0         |0|
+---------+----------+----------+-+
|b18935618|0         |0         |0|
+---------+----------+----------+-+
|b1893741x|10-14-2008|11-25-2008|0|
+---------+----------+----------+-+

And, if I run this code ("tblc=. tblb join tbl1") a third time (to add a
third column of values), things start going really awry:
+---------+----------+----------+-+----------+-+-+-+
|b18934225|08-28-2008|12-17-2008|0|08-28-2008|0|0|0|
+---------+----------+----------+-+----------+-+-+-+
|b18934286|08-28-2008|0         |0|08-28-2008|0|0|0|
+---------+----------+----------+-+----------+-+-+-+
|b18934304|0         |10-21-2008|0|0         |0|0|0|
+---------+----------+----------+-+----------+-+-+-+
|b18934468|09-01-2008|0         |0|09-01-2008|0|0|0|
+---------+----------+----------+-+----------+-+-+-+
|b18935618|0         |0         |0|0         |0|0|0|
+---------+----------+----------+-+----------+-+-+-+
|b1893741x|10-14-2008|11-25-2008|0|10-14-2008|0|0|0|
+---------+----------+----------+-+----------+-+-+-+

I am currently using (and needing) table update methodology for three
applications:

(1) The original need was for a cumulative report of monthly statistical
data, where (due to zero suppression in our database output reports,
over which we have no control) each month's output data has a different
subset of keys/ID's.  Since we know the exact full set of ID's, a table
of those can be starting point and "appended to" each month: 
    "ID" + "Jan" --> "ID,Jan"
    "ID,Jan" + "Feb" --> "ID,Jan,Feb"
    "ID,Jan,Feb" + "Mar" --> "ID,Jan,Feb,Mar"
    etc.

(2) We would like to see how user borrowing changes from year to year
(or even month to month, at least for testing purposes).  This requires
a table of IDs that is continually increasing over time as new people
move into the community and get library cards.  Our patron database
tracks the grand total of how many things each person has borrowed since
they got their library card.  By creating a "snapshot" report once a
year (or once a month) of all registered borrowers, I can get a column
of data with the "ID,value" pattern that is matched against the "master"
ID column of all borrowers of all time.  The "appending" of this new
column for a year (or a month) is the same as above, but the new
challenge of this application is that, to avoid having new ID's (of new
borrowers since the last snapshot) generate "key not found" errors, the
"master" list of ID's and the latest list of ID's (containing some new
ID's) have to be merged, deduped (via "nub"), and then have columns of
zeros created to "join" the new set of values with (as in #1 above).
Once all this is in place, then scans can be created which subtract each
pair of columns to find out how much activity occurred each year (or
month) for each ID.  That statistical data can then be manipulated to
one's heart's content for data mining purposes (for example, using both
horizontal and vertical sums).

(3) The third application is the one that generated my current set of
questions.  The question comes up periodically as to how long it takes
for materials to go through the cataloging area of the Technical
Services Department (or, phrased differently, how long are materials
waiting on shelves before they become available to the public).  When
you're dealing with 50,000 items annually, it can be an important
question.  (Cataloging is the area that describes materials in detail,
classifies and categorizes them, and puts them into the online library
catalog.)  Unlike both previous applications, this has a very limited
number of columns.  However, unlike the first but like the second
application above, there can be (if the study is carried on over time)
an increasing number of ID's as new materials are purchased.  There are
only four columns: ID, date of receipt, date of cataloging, number of
days difference between the two dates.  There are two inputs (that is,
output reports from our database): "ID,receiptdate" and
"ID,catalogingdate".  By merging and deduping (via "nub"), I created the
table of ID's that is the starting point for the joining of tables that
initiated this current series of questions.  (As I noted, in order to
avoid any "key not found" errors in Oleg's tacit expression, the ID
table must contain all possible ID's to begin with.)  Then I'm trying to
create the second and third columns of data, which is where the rub has
come.  Finally, which I've not yet attempted (because the data isn't in
the second and third columns), I need to convert the dates to ordinal
values and subtract them in order to put values into the fourth column.
That column will then become the basis for statistical analysis and
graphing, perhaps even incorporating how things may have changed over
time.

I hope all of this explains in far greater detail exactly WHAT I'm
trying to accomplish and HOW I've tried to accomplish them so far.
(And, in between all of this, I've been honing J skills by "playing"
with stock market data to accomplish various things, like creating code
to automatically identify historical highs and lows in data series based
either on time span or on percentage change.  I've also tried creating
some of my own indicators, but that's a bit more challenging.  Also,
once I retire next summer, I hope to get back to applying J to
computer-assisted musical composition.)

With all of this additional information, can anyone show the path to the
solution of what I'm trying to accomplish in the three applications
above?  Thanks, Dan, for your help so far!

Harvey
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Further question re: amending

Reply via email to