I get to J little enough these days so I'm a bit rusty when it comes
to the interesting stuff, and I'm stuck on a particular problem.

I start with a PDF report.  I run it through pdftotext and then
format/zulu's a2b to get a file that is mostly of the form

value
attribute
value
attribute
value
.
.
.
value
value
attribute
value
.
.
.

The first value of each entry has no explicit attribute name, although
"entry name" would be a suitable attribute name.  Some attributes span
multiple rows, and attributes may be of any reasonable length and do
include whitespace.  I know the set of attribute names, and some
include whitespace, too.  Some entries don't use all attributes.

There's one other complication: one attribute (call it 'location,' if
you will) has multiple rows that indicate multiple locations.  I need
to duplicate the full entry for each location listed in that entry.

For other's use, I want to output a csv file that has one entry per
row and each attribute in a separate column, with empty cells where
the attribute wasn't used.  I can then sort, search, and aggregate
inside J, as I wish, to process further myself.

Here's an example bit of data:

d1=: 0 : 0
alpha
Attribute 1
bravo
Attribute 2
charlie
delta
Location
echo
foxtrot
golf
Attribute 3
hotel
india
Attribute 1
juliet
Attribute 2
kilo
Location
lima
)

Here's what I think I want it to look like at an intermediate step:

d2 =: 0 : 0
Attribute 0: alpha
Attribute 1: bravo
Attribute 2: charlie delta
Location: echo
Attribute 3: hotel
Attribute 0: alpha
Attribute 1: bravo
Attribute 2: charlie delta
Location: foxtrot
Attribute 3: hotel
Attribute 0: alpha
Attribute 1: bravo
Attribute 2: charlie delta
Location: golf
Attribute 3: hotel
Attribute 0: india
Attribute 1: juliet
Attribute 2: kilo
Location: lima
Attribute 3:
)

Attribute 0 is always a one-liner, so I detect its value by backing up
one from 'Attribute 1'.  (I didn't pick the file format. :-) )

There are about 20-40 lines at the start that I need to
drop--everything before the first instance of a value for Attribute 0.

The final result, ready for analysis, would look something like

d3 =: 4 5 $  <;._2 d2

Better, it would look like that with everything up to and including
the first ':' elided (the value entries can include multiple colons)
and with the attributes as a header row.  I can manage the header, and
I'm pretty sure I can manage stripping out attribute names.

I've looked at JfC chapter 23 as a potentially useful spot, but I
haven't yet seen the light.  Suggestions of fruitful paths forward?

Thanks,

Bill
-- 
Bill Harris
http://facilitatedsystems.com/weblog/
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to