Ok... I'd do something like this: NB. stand-ins -- please replace definitions isLabel=: 'LA' e.~ {. isFirst=: 'Attribute1' -: -.&' ' processBatch=: < processItem=: <
NB. utility bitshift=: |.!.0 ((1 bitshift isFirst&>) processBatch@:(processItem/.~ [: +/\ isLabel&> );.1 ]) <;._2 d1 Note that this gives three levels of structure: batches, items, and the original lines. I'm using boxing to keep padding from one level of structure from padding at a different level. Based on your description, I believe that you would want to keep this, and I do not know if you would want to raze the result. FYI, -- Raul On Tue, Oct 23, 2012 at 4:10 PM, Bill Harris <bill_har...@facilitatedsystems.com> wrote: > I get to J little enough these days so I'm a bit rusty when it comes > to the interesting stuff, and I'm stuck on a particular problem. > > I start with a PDF report. I run it through pdftotext and then > format/zulu's a2b to get a file that is mostly of the form > > value > attribute > value > attribute > value > . > . > . > value > value > attribute > value > . > . > . > > The first value of each entry has no explicit attribute name, although > "entry name" would be a suitable attribute name. Some attributes span > multiple rows, and attributes may be of any reasonable length and do > include whitespace. I know the set of attribute names, and some > include whitespace, too. Some entries don't use all attributes. > > There's one other complication: one attribute (call it 'location,' if > you will) has multiple rows that indicate multiple locations. I need > to duplicate the full entry for each location listed in that entry. > > For other's use, I want to output a csv file that has one entry per > row and each attribute in a separate column, with empty cells where > the attribute wasn't used. I can then sort, search, and aggregate > inside J, as I wish, to process further myself. > > Here's an example bit of data: > > d1=: 0 : 0 > alpha > Attribute 1 > bravo > Attribute 2 > charlie > delta > Location > echo > foxtrot > golf > Attribute 3 > hotel > india > Attribute 1 > juliet > Attribute 2 > kilo > Location > lima > ) > > Here's what I think I want it to look like at an intermediate step: > > d2 =: 0 : 0 > Attribute 0: alpha > Attribute 1: bravo > Attribute 2: charlie delta > Location: echo > Attribute 3: hotel > Attribute 0: alpha > Attribute 1: bravo > Attribute 2: charlie delta > Location: foxtrot > Attribute 3: hotel > Attribute 0: alpha > Attribute 1: bravo > Attribute 2: charlie delta > Location: golf > Attribute 3: hotel > Attribute 0: india > Attribute 1: juliet > Attribute 2: kilo > Location: lima > Attribute 3: > ) > > Attribute 0 is always a one-liner, so I detect its value by backing up > one from 'Attribute 1'. (I didn't pick the file format. :-) ) > > There are about 20-40 lines at the start that I need to > drop--everything before the first instance of a value for Attribute 0. > > The final result, ready for analysis, would look something like > > d3 =: 4 5 $ <;._2 d2 > > Better, it would look like that with everything up to and including > the first ':' elided (the value entries can include multiple colons) > and with the attributes as a header row. I can manage the header, and > I'm pretty sure I can manage stripping out attribute names. > > I've looked at JfC chapter 23 as a potentially useful spot, but I > haven't yet seen the light. Suggestions of fruitful paths forward? > > Thanks, > > Bill > -- > Bill Harris > http://facilitatedsystems.com/weblog/ > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm