Re: [Jprogramming] "Segmented Strings"

Björn Helgason Tue, 08 Apr 2014 10:47:37 -0700

I would take a look at the mapped file database lab to get ideas.

-
Björn Helgason
gsm:6985532
skype:gosiminn
On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote:


> I have thought about using symbols, but the only way to delete symbols that
> I know of involves exiting J. And, my starting premise was that I would
> have too much data to fit into memory.
>
> For some computations it does make sense to start up an independent J
> session for each part of the calculation (and, in fact, that is what I am
> doing in a different aspect of dealing with this dataset - it's about 10
> terabytes, or so I am told - I've not actually seen it all yet and it takes
> time to upload it). But for some calculations you need to be able to
> correlate between pieces which have been dealt with elsewhere.
>
> A have similar reservations about fixed-width fields. There's just too much
> data for me to predict how wide the fields are going to be. In some cases I
> might actually be going with fixed-width, but that might be too inefficient
> for the general case. I've one field which would have to be over 100k in
> width if it was fixed width, even though typical cases are shorter than 1k.
> At some point I might go with fixed width, and I expect that doing so will
> cause me to lose a few records which will be discovered later in
> processing. That might not be a big deal, for this large of a data set, but
> if it's not necessary why bother?
>
> Finally, Bjorn's suggestion of using mapped files does seem like a good
> idea, at least for the character data. But that is an optimization and
> optimizations speed up some operations at the expense of slowing down other
> optimizations. So what really matters is the workload.
>
> Ultimately, for a dataset this large, it's going to take time.
>
> Thanks,
>
> --
> Raul
>
>
>
>
> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <[email protected]> wrote:
>
> > It seems this representation is somewhat similar to how the symbol table
> > stores strings:
> >
> > http://m.jsoftware.com/help/dictionary/dsco.htm
> >
> > Also, did you consider using symbols? I've used symbols for string
> columns
> > that contain highly repetitive data, for example, an invoice table with
> an
> > alpha-numeric SKU.
> >
> > Thanks for sharing
> >
> >
> >
> >
> >
> >
> > On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]>
> wrote:
> >
> > > Consider this example:
> > >
> > > table=:<;._2;._2]0 :0
> > > First Name,Last Name,Sum,
> > > Adam,Wallace,19,
> > > Travis,Smith,10,
> > > Donald,Barnell,8,
> > > Gary,Wallace,27,
> > > James,Smith,10,
> > > Sam,Johnson,10,
> > > Travis,Neal,11,
> > > Adam,Campbell,11,
> > > Walter,Abbott,13,
> > > )
> > >
> > > Using boxed strings works great for relatively small sets of data. But
> > when
> > > things get big, their overhead starts to hurt to much.  (Big means: so
> > much
> > > data that you'll probably not be able to fit it all in memory at the
> same
> > > time. So you need to plan on relatively frequent delays while reading
> > from
> > > disk.)
> > >
> > > One alternative to boxed strings is segmented strings. A segmented
> string
> > > is an argument which could be passed to <;._1. It's basically just a
> > string
> > > with a prefix delimiter. You can work with these sorts of strings
> > directly,
> > > and achieve results similar to what you would achieve with boxed
> arrays.
> > >
> > > Segmented strings are a bit clumsier than boxed arrays - you lose a lot
> > of
> > > the integrity checks, so if you mess up you probably will not see an
> > error.
> > > So it's probably a good idea to model your code using boxed arrays on a
> > > small set of data and then convert to segmented representation once
> > you're
> > > happy with how things work (and once you see a time cost that makes it
> > > worth spending the time to rework your code).
> > >
> > > Also, to avoid having to use f;._2 (or whatever) every time, it's good
> to
> > > do an initial pass on the data, to extract its structure.
> > >
> > > Here's an example:
> > >
> > > FirstName=:;LF&,each }.0{"1 table
> > >
> > > LastName=:;LF&,each }.1{"1 table
> > >
> > > Sum=:;LF&,each }.2{"1 table
> > >
> > >
> > > ssdir=: [:(}:,:2-~/\])I.@(= {.),#
> > >
> > > FirstNameDir=: ssdir FirstName
> > > LastNameDir=: ssdir LastName
> > >
> > > Actually, sum is numeric so let's just use a numeric representation for
> > > that column
> > >
> > > Sum=: _&".@> }.2{"1 table
> > >
> > > Which rows have a last name of Smith?
> > >
> > >    <:({.LastNameDir) I. I.'Smith' E. LastName
> > >
> > > 1 4
> > >
> > >
> > > Actually, there's an assumption there that Smith is not part of some
> > larger
> > > name. We can include the delimiter in the search if we are concerned
> > about
> > > that. For even more protection we could append a trailing delimiter on
> > our
> > > segmented string and then search for (in this case) LF,'Smith',LF.
> > >
> > >
> > > Anyways, let's extract the corresponding sums and first name:
> > >
> > >
> > >    1 4{Sum
> > >
> > > 10 10
> > >
> > >
> > >    FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
> > >
> > >
> > > Travis
> > >
> > > James
> > >
> > >
> > > Note that that last expression is a bit complicated. It's not so bad,
> > > though, if what you are extracting is a small part of the total. And,
> in
> > > that case, using a list of indices to express a boolean result seems
> > like a
> > > good thing. You wind up working with set operations (intersection and
> > > union) rather than logical operations (and and or). Also, set
> difference
> > > instead of logical not (dyadic -. instead of monadic -.).
> > >
> > >
> > > intersect=: [ -. -.
> > >
> > > union=. ~.@,
> > >
> > >
> > > (It looks like I might be using this kind of thing really soon, so I
> > > thought I'd lay down my thoughts here and invite comment.)
> > >
> > >
> > > Thanks,
> > >
> > >
> > > --
> > >
> > > Raul
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to