Re: [Jprogramming] "Segmented Strings"

Raul Miller Tue, 08 Apr 2014 17:22:27 -0700

I might indeed do that, but in some cases the time to read the file itself
will be mostly network transfer time. And, once it's in memory, how it got
there isn't really an issue.


Still, it's worth benchmarking.

Thanks,

-- 
Raul


On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]> wrote:

> I second memory mapped files and mapped file database.
>
>
> On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller <[email protected]> wrote:
>
> > It's available for free now, with some limitations:
> >
> > http://kx.com/software-download.php
> >
> > It'll take me a few years, though, to develop a fluency in K (Q actually,
> > or kdb+ ...) which approaches my fluency in other languages. Anyways,
> it's
> > not at all clear that K (or Q or KDB+) would be any better for this
> > application than J. The grass is always greener on the other side of the
> > fence, especially after you've crossed it?
> >
> > Also, if I do my job properly, the language itself becomes irrelevant and
> > the data structures are straightforward enough to allow any arbitrary
> > language to be used.
> >
> > (Meanwhile, I've got J running on OpenBSD, which pleases me.)
> >
> > --
> > Raul
> >
> > Thanks,
> >
> > --
> > Raul
> >
> >
> > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote:
> >
> > > I think I would pay for k's database capability.  --Kip Murray
> > >
> > > Sent from my iPad
> > >
> > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]>
> wrote:
> > > >
> > > > I would take a look at the mapped file database lab to get ideas.
> > > >
> > > > -
> > > > Björn Helgason
> > > > gsm:6985532
> > > > skype:gosiminn
> > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote:
> > > >>
> > > >> I have thought about using symbols, but the only way to delete
> symbols
> > > that
> > > >> I know of involves exiting J. And, my starting premise was that I
> > would
> > > >> have too much data to fit into memory.
> > > >>
> > > >> For some computations it does make sense to start up an independent
> J
> > > >> session for each part of the calculation (and, in fact, that is
> what I
> > > am
> > > >> doing in a different aspect of dealing with this dataset - it's
> about
> > 10
> > > >> terabytes, or so I am told - I've not actually seen it all yet and
> it
> > > takes
> > > >> time to upload it). But for some calculations you need to be able to
> > > >> correlate between pieces which have been dealt with elsewhere.
> > > >>
> > > >> A have similar reservations about fixed-width fields. There's just
> too
> > > much
> > > >> data for me to predict how wide the fields are going to be. In some
> > > cases I
> > > >> might actually be going with fixed-width, but that might be too
> > > inefficient
> > > >> for the general case. I've one field which would have to be over
> 100k
> > in
> > > >> width if it was fixed width, even though typical cases are shorter
> > than
> > > 1k.
> > > >> At some point I might go with fixed width, and I expect that doing
> so
> > > will
> > > >> cause me to lose a few records which will be discovered later in
> > > >> processing. That might not be a big deal, for this large of a data
> > set,
> > > but
> > > >> if it's not necessary why bother?
> > > >>
> > > >> Finally, Bjorn's suggestion of using mapped files does seem like a
> > good
> > > >> idea, at least for the character data. But that is an optimization
> and
> > > >> optimizations speed up some operations at the expense of slowing
> down
> > > other
> > > >> optimizations. So what really matters is the workload.
> > > >>
> > > >> Ultimately, for a dataset this large, it's going to take time.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> --
> > > >> Raul
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <[email protected]>
> > > wrote:
> > > >>>
> > > >>> It seems this representation is somewhat similar to how the symbol
> > > table
> > > >>> stores strings:
> > > >>>
> > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm
> > > >>>
> > > >>> Also, did you consider using symbols? I've used symbols for string
> > > >> columns
> > > >>> that contain highly repetitive data, for example, an invoice table
> > with
> > > >> an
> > > >>> alpha-numeric SKU.
> > > >>>
> > > >>> Thanks for sharing
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]
> >
> > > >> wrote:
> > > >>>
> > > >>>> Consider this example:
> > > >>>>
> > > >>>> table=:<;._2;._2]0 :0
> > > >>>> First Name,Last Name,Sum,
> > > >>>> Adam,Wallace,19,
> > > >>>> Travis,Smith,10,
> > > >>>> Donald,Barnell,8,
> > > >>>> Gary,Wallace,27,
> > > >>>> James,Smith,10,
> > > >>>> Sam,Johnson,10,
> > > >>>> Travis,Neal,11,
> > > >>>> Adam,Campbell,11,
> > > >>>> Walter,Abbott,13,
> > > >>>> )
> > > >>>>
> > > >>>> Using boxed strings works great for relatively small sets of data.
> > But
> > > >>> when
> > > >>>> things get big, their overhead starts to hurt to much.  (Big
> means:
> > so
> > > >>> much
> > > >>>> data that you'll probably not be able to fit it all in memory at
> the
> > > >> same
> > > >>>> time. So you need to plan on relatively frequent delays while
> > reading
> > > >>> from
> > > >>>> disk.)
> > > >>>>
> > > >>>> One alternative to boxed strings is segmented strings. A segmented
> > > >> string
> > > >>>> is an argument which could be passed to <;._1. It's basically
> just a
> > > >>> string
> > > >>>> with a prefix delimiter. You can work with these sorts of strings
> > > >>> directly,
> > > >>>> and achieve results similar to what you would achieve with boxed
> > > >> arrays.
> > > >>>>
> > > >>>> Segmented strings are a bit clumsier than boxed arrays - you lose
> a
> > > lot
> > > >>> of
> > > >>>> the integrity checks, so if you mess up you probably will not see
> an
> > > >>> error.
> > > >>>> So it's probably a good idea to model your code using boxed arrays
> > on
> > > a
> > > >>>> small set of data and then convert to segmented representation
> once
> > > >>> you're
> > > >>>> happy with how things work (and once you see a time cost that
> makes
> > it
> > > >>>> worth spending the time to rework your code).
> > > >>>>
> > > >>>> Also, to avoid having to use f;._2 (or whatever) every time, it's
> > good
> > > >> to
> > > >>>> do an initial pass on the data, to extract its structure.
> > > >>>>
> > > >>>> Here's an example:
> > > >>>>
> > > >>>> FirstName=:;LF&,each }.0{"1 table
> > > >>>>
> > > >>>> LastName=:;LF&,each }.1{"1 table
> > > >>>>
> > > >>>> Sum=:;LF&,each }.2{"1 table
> > > >>>>
> > > >>>>
> > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),#
> > > >>>>
> > > >>>> FirstNameDir=: ssdir FirstName
> > > >>>> LastNameDir=: ssdir LastName
> > > >>>>
> > > >>>> Actually, sum is numeric so let's just use a numeric
> representation
> > > for
> > > >>>> that column
> > > >>>>
> > > >>>> Sum=: _&".@> }.2{"1 table
> > > >>>>
> > > >>>> Which rows have a last name of Smith?
> > > >>>>
> > > >>>>   <:({.LastNameDir) I. I.'Smith' E. LastName
> > > >>>>
> > > >>>> 1 4
> > > >>>>
> > > >>>>
> > > >>>> Actually, there's an assumption there that Smith is not part of
> some
> > > >>> larger
> > > >>>> name. We can include the delimiter in the search if we are
> concerned
> > > >>> about
> > > >>>> that. For even more protection we could append a trailing
> delimiter
> > on
> > > >>> our
> > > >>>> segmented string and then search for (in this case) LF,'Smith',LF.
> > > >>>>
> > > >>>>
> > > >>>> Anyways, let's extract the corresponding sums and first name:
> > > >>>>
> > > >>>>
> > > >>>>   1 4{Sum
> > > >>>>
> > > >>>> 10 10
> > > >>>>
> > > >>>>
> > > >>>>   FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
> > > >>>>
> > > >>>>
> > > >>>> Travis
> > > >>>>
> > > >>>> James
> > > >>>>
> > > >>>>
> > > >>>> Note that that last expression is a bit complicated. It's not so
> > bad,
> > > >>>> though, if what you are extracting is a small part of the total.
> > And,
> > > >> in
> > > >>>> that case, using a list of indices to express a boolean result
> seems
> > > >>> like a
> > > >>>> good thing. You wind up working with set operations (intersection
> > and
> > > >>>> union) rather than logical operations (and and or). Also, set
> > > >> difference
> > > >>>> instead of logical not (dyadic -. instead of monadic -.).
> > > >>>>
> > > >>>>
> > > >>>> intersect=: [ -. -.
> > > >>>>
> > > >>>> union=. ~.@,
> > > >>>>
> > > >>>>
> > > >>>> (It looks like I might be using this kind of thing really soon,
> so I
> > > >>>> thought I'd lay down my thoughts here and invite comment.)
> > > >>>>
> > > >>>>
> > > >>>> Thanks,
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>>
> > > >>>> Raul
> > > >>>>
> > ----------------------------------------------------------------------
> > > >>>> For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > > >>>>
> > > >>>
> > ----------------------------------------------------------------------
> > > >>> For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > >>>
> > > >>
> ----------------------------------------------------------------------
> > > >> For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > >>
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to