Re: [Jprogramming] "Segmented Strings"

Raul Miller Tue, 08 Apr 2014 08:34:32 -0700

I have thought about using symbols, but the only way to delete symbols that
I know of involves exiting J. And, my starting premise was that I would
have too much data to fit into memory.


For some computations it does make sense to start up an independent J
session for each part of the calculation (and, in fact, that is what I am
doing in a different aspect of dealing with this dataset - it's about 10
terabytes, or so I am told - I've not actually seen it all yet and it takes
time to upload it). But for some calculations you need to be able to
correlate between pieces which have been dealt with elsewhere.

A have similar reservations about fixed-width fields. There's just too much
data for me to predict how wide the fields are going to be. In some cases I
might actually be going with fixed-width, but that might be too inefficient
for the general case. I've one field which would have to be over 100k in
width if it was fixed width, even though typical cases are shorter than 1k.
At some point I might go with fixed width, and I expect that doing so will
cause me to lose a few records which will be discovered later in
processing. That might not be a big deal, for this large of a data set, but
if it's not necessary why bother?

Finally, Bjorn's suggestion of using mapped files does seem like a good
idea, at least for the character data. But that is an optimization and
optimizations speed up some operations at the expense of slowing down other
optimizations. So what really matters is the workload.

Ultimately, for a dataset this large, it's going to take time.

Thanks,

-- 
Raul




On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <[email protected]> wrote:

> It seems this representation is somewhat similar to how the symbol table
> stores strings:
>
> http://m.jsoftware.com/help/dictionary/dsco.htm
>
> Also, did you consider using symbols? I've used symbols for string columns
> that contain highly repetitive data, for example, an invoice table with an
> alpha-numeric SKU.
>
> Thanks for sharing
>
>
>
>
>
>
> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]> wrote:
>
> > Consider this example:
> >
> > table=:<;._2;._2]0 :0
> > First Name,Last Name,Sum,
> > Adam,Wallace,19,
> > Travis,Smith,10,
> > Donald,Barnell,8,
> > Gary,Wallace,27,
> > James,Smith,10,
> > Sam,Johnson,10,
> > Travis,Neal,11,
> > Adam,Campbell,11,
> > Walter,Abbott,13,
> > )
> >
> > Using boxed strings works great for relatively small sets of data. But
> when
> > things get big, their overhead starts to hurt to much.  (Big means: so
> much
> > data that you'll probably not be able to fit it all in memory at the same
> > time. So you need to plan on relatively frequent delays while reading
> from
> > disk.)
> >
> > One alternative to boxed strings is segmented strings. A segmented string
> > is an argument which could be passed to <;._1. It's basically just a
> string
> > with a prefix delimiter. You can work with these sorts of strings
> directly,
> > and achieve results similar to what you would achieve with boxed arrays.
> >
> > Segmented strings are a bit clumsier than boxed arrays - you lose a lot
> of
> > the integrity checks, so if you mess up you probably will not see an
> error.
> > So it's probably a good idea to model your code using boxed arrays on a
> > small set of data and then convert to segmented representation once
> you're
> > happy with how things work (and once you see a time cost that makes it
> > worth spending the time to rework your code).
> >
> > Also, to avoid having to use f;._2 (or whatever) every time, it's good to
> > do an initial pass on the data, to extract its structure.
> >
> > Here's an example:
> >
> > FirstName=:;LF&,each }.0{"1 table
> >
> > LastName=:;LF&,each }.1{"1 table
> >
> > Sum=:;LF&,each }.2{"1 table
> >
> >
> > ssdir=: [:(}:,:2-~/\])I.@(= {.),#
> >
> > FirstNameDir=: ssdir FirstName
> > LastNameDir=: ssdir LastName
> >
> > Actually, sum is numeric so let's just use a numeric representation for
> > that column
> >
> > Sum=: _&".@> }.2{"1 table
> >
> > Which rows have a last name of Smith?
> >
> >    <:({.LastNameDir) I. I.'Smith' E. LastName
> >
> > 1 4
> >
> >
> > Actually, there's an assumption there that Smith is not part of some
> larger
> > name. We can include the delimiter in the search if we are concerned
> about
> > that. For even more protection we could append a trailing delimiter on
> our
> > segmented string and then search for (in this case) LF,'Smith',LF.
> >
> >
> > Anyways, let's extract the corresponding sums and first name:
> >
> >
> >    1 4{Sum
> >
> > 10 10
> >
> >
> >    FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
> >
> >
> > Travis
> >
> > James
> >
> >
> > Note that that last expression is a bit complicated. It's not so bad,
> > though, if what you are extracting is a small part of the total. And, in
> > that case, using a list of indices to express a boolean result seems
> like a
> > good thing. You wind up working with set operations (intersection and
> > union) rather than logical operations (and and or). Also, set difference
> > instead of logical not (dyadic -. instead of monadic -.).
> >
> >
> > intersect=: [ -. -.
> >
> > union=. ~.@,
> >
> >
> > (It looks like I might be using this kind of thing really soon, so I
> > thought I'd lay down my thoughts here and invite comment.)
> >
> >
> > Thanks,
> >
> >
> > --
> >
> > Raul
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to