Re: [Jprogramming] "Segmented Strings"

Raul Miller Wed, 09 Apr 2014 07:06:57 -0700

How?

Thanks,


-- 
Raul



On Wed, Apr 9, 2014 at 3:44 AM, Linda Alvord <[email protected]>wrote:

> I would not get rid of your table made of strings.  I would access it in
> the form of J tables because that is what J does nicely.
>
> Linda
>
> -----Original Message-----
> From: [email protected] [mailto:
> [email protected]] On Behalf Of Raul Miller
> Sent: Wednesday, April 09, 2014 2:48 AM
> To: Programming forum
> Subject: Re: [Jprogramming] "Segmented Strings"
>
> The plan is that segmented strings are the data in the database.
>
> There's just too much information to hold it all in memory on a single
> machine.
>
> Thanks,
>
> --
> Raul
>
>
> On Wed, Apr 9, 2014 at 2:23 AM, Linda Alvord <[email protected]
> >wrote:
>
> > I know almost nothing about large databases, but what is the advantage of
> > staying with sstrings after the data base is built?
> >
> > Once you have your table, or maybe two or more tables of character and
> > numeric data, you might "stay in J and make "subtables" which can be
> > catenated together and destroyed as needed.  You could also do selections
> > of subsets more easily.
> >
> >   ]FirstName=:;LF&,each }.0{"1 table
> >
> > Adam
> > Travis
> > Donald
> > Gary
> > James
> > Sam
> > Travis
> > Adam
> > Walter
> >
> >    ]FN2=:   >"0 }.0{"1 table
> > Adam
> > Travis
> > Donald
> > Gary
> > James
> > Sam
> > Travis
> > Adam
> > Walter
> >
> >    FN2-:FirstName
> > 0
> >    $FirstName
> > 53
> >    $FN2
> > 9 6
> >
> > Linda
> >
> >
> > -----Original Message-----
> > From: [email protected] [mailto:
> > [email protected]] On Behalf Of Raul Miller
> > Sent: Tuesday, April 08, 2014 8:22 PM
> > To: Programming forum
> > Subject: Re: [Jprogramming] "Segmented Strings"
> >
> > I might indeed do that, but in some cases the time to read the file
> itself
> > will be mostly network transfer time. And, once it's in memory, how it
> got
> > there isn't really an issue.
> >
> > Still, it's worth benchmarking.
> >
> > Thanks,
> >
> > --
> > Raul
> >
> >
> > On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]>
> wrote:
> >
> > > I second memory mapped files and mapped file database.
> > >
> > >
> > > On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller <[email protected]>
> > wrote:
> > >
> > > > It's available for free now, with some limitations:
> > > >
> > > > http://kx.com/software-download.php
> > > >
> > > > It'll take me a few years, though, to develop a fluency in K (Q
> > actually,
> > > > or kdb+ ...) which approaches my fluency in other languages. Anyways,
> > > it's
> > > > not at all clear that K (or Q or KDB+) would be any better for this
> > > > application than J. The grass is always greener on the other side of
> > the
> > > > fence, especially after you've crossed it?
> > > >
> > > > Also, if I do my job properly, the language itself becomes irrelevant
> > and
> > > > the data structures are straightforward enough to allow any arbitrary
> > > > language to be used.
> > > >
> > > > (Meanwhile, I've got J running on OpenBSD, which pleases me.)
> > > >
> > > > --
> > > > Raul
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Raul
> > > >
> > > >
> > > > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote:
> > > >
> > > > > I think I would pay for k's database capability.  --Kip Murray
> > > > >
> > > > > Sent from my iPad
> > > > >
> > > > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]>
> > > wrote:
> > > > > >
> > > > > > I would take a look at the mapped file database lab to get ideas.
> > > > > >
> > > > > > -
> > > > > > Björn Helgason
> > > > > > gsm:6985532
> > > > > > skype:gosiminn
> > > > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote:
> > > > > >>
> > > > > >> I have thought about using symbols, but the only way to delete
> > > symbols
> > > > > that
> > > > > >> I know of involves exiting J. And, my starting premise was that
> I
> > > > would
> > > > > >> have too much data to fit into memory.
> > > > > >>
> > > > > >> For some computations it does make sense to start up an
> > independent
> > > J
> > > > > >> session for each part of the calculation (and, in fact, that is
> > > what I
> > > > > am
> > > > > >> doing in a different aspect of dealing with this dataset - it's
> > > about
> > > > 10
> > > > > >> terabytes, or so I am told - I've not actually seen it all yet
> and
> > > it
> > > > > takes
> > > > > >> time to upload it). But for some calculations you need to be
> able
> > to
> > > > > >> correlate between pieces which have been dealt with elsewhere.
> > > > > >>
> > > > > >> A have similar reservations about fixed-width fields. There's
> just
> > > too
> > > > > much
> > > > > >> data for me to predict how wide the fields are going to be. In
> > some
> > > > > cases I
> > > > > >> might actually be going with fixed-width, but that might be too
> > > > > inefficient
> > > > > >> for the general case. I've one field which would have to be over
> > > 100k
> > > > in
> > > > > >> width if it was fixed width, even though typical cases are
> shorter
> > > > than
> > > > > 1k.
> > > > > >> At some point I might go with fixed width, and I expect that
> doing
> > > so
> > > > > will
> > > > > >> cause me to lose a few records which will be discovered later in
> > > > > >> processing. That might not be a big deal, for this large of a
> data
> > > > set,
> > > > > but
> > > > > >> if it's not necessary why bother?
> > > > > >>
> > > > > >> Finally, Bjorn's suggestion of using mapped files does seem
> like a
> > > > good
> > > > > >> idea, at least for the character data. But that is an
> optimization
> > > and
> > > > > >> optimizations speed up some operations at the expense of slowing
> > > down
> > > > > other
> > > > > >> optimizations. So what really matters is the workload.
> > > > > >>
> > > > > >> Ultimately, for a dataset this large, it's going to take time.
> > > > > >>
> > > > > >> Thanks,
> > > > > >>
> > > > > >> --
> > > > > >> Raul
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <
> [email protected]>
> > > > > wrote:
> > > > > >>>
> > > > > >>> It seems this representation is somewhat similar to how the
> > symbol
> > > > > table
> > > > > >>> stores strings:
> > > > > >>>
> > > > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm
> > > > > >>>
> > > > > >>> Also, did you consider using symbols? I've used symbols for
> > string
> > > > > >> columns
> > > > > >>> that contain highly repetitive data, for example, an invoice
> > table
> > > > with
> > > > > >> an
> > > > > >>> alpha-numeric SKU.
> > > > > >>>
> > > > > >>> Thanks for sharing
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <
> > [email protected]
> > > >
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Consider this example:
> > > > > >>>>
> > > > > >>>> table=:<;._2;._2]0 :0
> > > > > >>>> First Name,Last Name,Sum,
> > > > > >>>> Adam,Wallace,19,
> > > > > >>>> Travis,Smith,10,
> > > > > >>>> Donald,Barnell,8,
> > > > > >>>> Gary,Wallace,27,
> > > > > >>>> James,Smith,10,
> > > > > >>>> Sam,Johnson,10,
> > > > > >>>> Travis,Neal,11,
> > > > > >>>> Adam,Campbell,11,
> > > > > >>>> Walter,Abbott,13,
> > > > > >>>> )
> > > > > >>>>
> > > > > >>>> Using boxed strings works great for relatively small sets of
> > data.
> > > > But
> > > > > >>> when
> > > > > >>>> things get big, their overhead starts to hurt to much.  (Big
> > > means:
> > > > so
> > > > > >>> much
> > > > > >>>> data that you'll probably not be able to fit it all in memory
> at
> > > the
> > > > > >> same
> > > > > >>>> time. So you need to plan on relatively frequent delays while
> > > > reading
> > > > > >>> from
> > > > > >>>> disk.)
> > > > > >>>>
> > > > > >>>> One alternative to boxed strings is segmented strings. A
> > segmented
> > > > > >> string
> > > > > >>>> is an argument which could be passed to <;._1. It's basically
> > > just a
> > > > > >>> string
> > > > > >>>> with a prefix delimiter. You can work with these sorts of
> > strings
> > > > > >>> directly,
> > > > > >>>> and achieve results similar to what you would achieve with
> boxed
> > > > > >> arrays.
> > > > > >>>>
> > > > > >>>> Segmented strings are a bit clumsier than boxed arrays - you
> > lose
> > > a
> > > > > lot
> > > > > >>> of
> > > > > >>>> the integrity checks, so if you mess up you probably will not
> > see
> > > an
> > > > > >>> error.
> > > > > >>>> So it's probably a good idea to model your code using boxed
> > arrays
> > > > on
> > > > > a
> > > > > >>>> small set of data and then convert to segmented representation
> > > once
> > > > > >>> you're
> > > > > >>>> happy with how things work (and once you see a time cost that
> > > makes
> > > > it
> > > > > >>>> worth spending the time to rework your code).
> > > > > >>>>
> > > > > >>>> Also, to avoid having to use f;._2 (or whatever) every time,
> > it's
> > > > good
> > > > > >> to
> > > > > >>>> do an initial pass on the data, to extract its structure.
> > > > > >>>>
> > > > > >>>> Here's an example:
> > > > > >>>>
> > > > > >>>> FirstName=:;LF&,each }.0{"1 table
> > > > > >>>>
> > > > > >>>> LastName=:;LF&,each }.1{"1 table
> > > > > >>>>
> > > > > >>>> Sum=:;LF&,each }.2{"1 table
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),#
> > > > > >>>>
> > > > > >>>> FirstNameDir=: ssdir FirstName
> > > > > >>>> LastNameDir=: ssdir LastName
> > > > > >>>>
> > > > > >>>> Actually, sum is numeric so let's just use a numeric
> > > representation
> > > > > for
> > > > > >>>> that column
> > > > > >>>>
> > > > > >>>> Sum=: _&".@> }.2{"1 table
> > > > > >>>>
> > > > > >>>> Which rows have a last name of Smith?
> > > > > >>>>
> > > > > >>>>   <:({.LastNameDir) I. I.'Smith' E. LastName
> > > > > >>>>
> > > > > >>>> 1 4
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Actually, there's an assumption there that Smith is not part
> of
> > > some
> > > > > >>> larger
> > > > > >>>> name. We can include the delimiter in the search if we are
> > > concerned
> > > > > >>> about
> > > > > >>>> that. For even more protection we could append a trailing
> > > delimiter
> > > > on
> > > > > >>> our
> > > > > >>>> segmented string and then search for (in this case)
> > LF,'Smith',LF.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Anyways, let's extract the corresponding sums and first name:
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>   1 4{Sum
> > > > > >>>>
> > > > > >>>> 10 10
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>   FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Travis
> > > > > >>>>
> > > > > >>>> James
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Note that that last expression is a bit complicated. It's not
> so
> > > > bad,
> > > > > >>>> though, if what you are extracting is a small part of the
> total.
> > > > And,
> > > > > >> in
> > > > > >>>> that case, using a list of indices to express a boolean result
> > > seems
> > > > > >>> like a
> > > > > >>>> good thing. You wind up working with set operations
> > (intersection
> > > > and
> > > > > >>>> union) rather than logical operations (and and or). Also, set
> > > > > >> difference
> > > > > >>>> instead of logical not (dyadic -. instead of monadic -.).
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> intersect=: [ -. -.
> > > > > >>>>
> > > > > >>>> union=. ~.@,
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> (It looks like I might be using this kind of thing really
> soon,
> > > so I
> > > > > >>>> thought I'd lay down my thoughts here and invite comment.)
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Thanks,
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>>
> > > > > >>>> Raul
> > > > > >>>>
> > > >
> ----------------------------------------------------------------------
> > > > > >>>> For information about J forums see
> > > > > http://www.jsoftware.com/forums.htm
> > > > > >>>>
> > > > > >>>
> > > >
> ----------------------------------------------------------------------
> > > > > >>> For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > > > >>>
> > > > > >>
> > > ----------------------------------------------------------------------
> > > > > >> For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > > > >>
> > > > > >
> > > ----------------------------------------------------------------------
> > > > > > For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > > > >
> > ----------------------------------------------------------------------
> > > > > For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > > >
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to