Ok... I'd do something like this:

NB. stand-ins -- please replace definitions
isLabel=: 'LA' e.~ {.
isFirst=: 'Attribute1' -: -.&' '
processBatch=: <
processItem=: <

NB. utility
bitshift=: |.!.0

   ((1 bitshift isFirst&>) processBatch@:(processItem/.~ [: +/\
isLabel&> );.1 ]) <;._2 d1

Note that this gives three levels of structure: batches, items, and
the original lines.  I'm using boxing to keep padding from one level
of structure from padding at a different level.  Based on your
description, I believe that you would want to keep this, and I do not
know if you would want to raze the result.

FYI,

-- 
Raul

On Tue, Oct 23, 2012 at 4:10 PM, Bill Harris
<bill_har...@facilitatedsystems.com> wrote:
> I get to J little enough these days so I'm a bit rusty when it comes
> to the interesting stuff, and I'm stuck on a particular problem.
>
> I start with a PDF report.  I run it through pdftotext and then
> format/zulu's a2b to get a file that is mostly of the form
>
> value
> attribute
> value
> attribute
> value
> .
> .
> .
> value
> value
> attribute
> value
> .
> .
> .
>
> The first value of each entry has no explicit attribute name, although
> "entry name" would be a suitable attribute name.  Some attributes span
> multiple rows, and attributes may be of any reasonable length and do
> include whitespace.  I know the set of attribute names, and some
> include whitespace, too.  Some entries don't use all attributes.
>
> There's one other complication: one attribute (call it 'location,' if
> you will) has multiple rows that indicate multiple locations.  I need
> to duplicate the full entry for each location listed in that entry.
>
> For other's use, I want to output a csv file that has one entry per
> row and each attribute in a separate column, with empty cells where
> the attribute wasn't used.  I can then sort, search, and aggregate
> inside J, as I wish, to process further myself.
>
> Here's an example bit of data:
>
> d1=: 0 : 0
> alpha
> Attribute 1
> bravo
> Attribute 2
> charlie
> delta
> Location
> echo
> foxtrot
> golf
> Attribute 3
> hotel
> india
> Attribute 1
> juliet
> Attribute 2
> kilo
> Location
> lima
> )
>
> Here's what I think I want it to look like at an intermediate step:
>
> d2 =: 0 : 0
> Attribute 0: alpha
> Attribute 1: bravo
> Attribute 2: charlie delta
> Location: echo
> Attribute 3: hotel
> Attribute 0: alpha
> Attribute 1: bravo
> Attribute 2: charlie delta
> Location: foxtrot
> Attribute 3: hotel
> Attribute 0: alpha
> Attribute 1: bravo
> Attribute 2: charlie delta
> Location: golf
> Attribute 3: hotel
> Attribute 0: india
> Attribute 1: juliet
> Attribute 2: kilo
> Location: lima
> Attribute 3:
> )
>
> Attribute 0 is always a one-liner, so I detect its value by backing up
> one from 'Attribute 1'.  (I didn't pick the file format. :-) )
>
> There are about 20-40 lines at the start that I need to
> drop--everything before the first instance of a value for Attribute 0.
>
> The final result, ready for analysis, would look something like
>
> d3 =: 4 5 $  <;._2 d2
>
> Better, it would look like that with everything up to and including
> the first ':' elided (the value entries can include multiple colons)
> and with the attributes as a header row.  I can manage the header, and
> I'm pretty sure I can manage stripping out attribute names.
>
> I've looked at JfC chapter 23 as a potentially useful spot, but I
> haven't yet seen the light.  Suggestions of fruitful paths forward?
>
> Thanks,
>
> Bill
> --
> Bill Harris
> http://facilitatedsystems.com/weblog/
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to