Re: [Jprogramming] "Segmented Strings"

Raul Miller Tue, 08 Apr 2014 13:52:27 -0700

It's available for free now, with some limitations:

http://kx.com/software-download.php


It'll take me a few years, though, to develop a fluency in K (Q actually,
or kdb+ ...) which approaches my fluency in other languages. Anyways, it's
not at all clear that K (or Q or KDB+) would be any better for this
application than J. The grass is always greener on the other side of the
fence, especially after you've crossed it?

Also, if I do my job properly, the language itself becomes irrelevant and
the data structures are straightforward enough to allow any arbitrary
language to be used.

(Meanwhile, I've got J running on OpenBSD, which pleases me.)

-- 
Raul

Thanks,

-- 
Raul


On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote:

> I think I would pay for k's database capability.  --Kip Murray
>
> Sent from my iPad
>
> > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]> wrote:
> >
> > I would take a look at the mapped file database lab to get ideas.
> >
> > -
> > Björn Helgason
> > gsm:6985532
> > skype:gosiminn
> >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote:
> >>
> >> I have thought about using symbols, but the only way to delete symbols
> that
> >> I know of involves exiting J. And, my starting premise was that I would
> >> have too much data to fit into memory.
> >>
> >> For some computations it does make sense to start up an independent J
> >> session for each part of the calculation (and, in fact, that is what I
> am
> >> doing in a different aspect of dealing with this dataset - it's about 10
> >> terabytes, or so I am told - I've not actually seen it all yet and it
> takes
> >> time to upload it). But for some calculations you need to be able to
> >> correlate between pieces which have been dealt with elsewhere.
> >>
> >> A have similar reservations about fixed-width fields. There's just too
> much
> >> data for me to predict how wide the fields are going to be. In some
> cases I
> >> might actually be going with fixed-width, but that might be too
> inefficient
> >> for the general case. I've one field which would have to be over 100k in
> >> width if it was fixed width, even though typical cases are shorter than
> 1k.
> >> At some point I might go with fixed width, and I expect that doing so
> will
> >> cause me to lose a few records which will be discovered later in
> >> processing. That might not be a big deal, for this large of a data set,
> but
> >> if it's not necessary why bother?
> >>
> >> Finally, Bjorn's suggestion of using mapped files does seem like a good
> >> idea, at least for the character data. But that is an optimization and
> >> optimizations speed up some operations at the expense of slowing down
> other
> >> optimizations. So what really matters is the workload.
> >>
> >> Ultimately, for a dataset this large, it's going to take time.
> >>
> >> Thanks,
> >>
> >> --
> >> Raul
> >>
> >>
> >>
> >>
> >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <[email protected]>
> wrote:
> >>>
> >>> It seems this representation is somewhat similar to how the symbol
> table
> >>> stores strings:
> >>>
> >>> http://m.jsoftware.com/help/dictionary/dsco.htm
> >>>
> >>> Also, did you consider using symbols? I've used symbols for string
> >> columns
> >>> that contain highly repetitive data, for example, an invoice table with
> >> an
> >>> alpha-numeric SKU.
> >>>
> >>> Thanks for sharing
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]>
> >> wrote:
> >>>
> >>>> Consider this example:
> >>>>
> >>>> table=:<;._2;._2]0 :0
> >>>> First Name,Last Name,Sum,
> >>>> Adam,Wallace,19,
> >>>> Travis,Smith,10,
> >>>> Donald,Barnell,8,
> >>>> Gary,Wallace,27,
> >>>> James,Smith,10,
> >>>> Sam,Johnson,10,
> >>>> Travis,Neal,11,
> >>>> Adam,Campbell,11,
> >>>> Walter,Abbott,13,
> >>>> )
> >>>>
> >>>> Using boxed strings works great for relatively small sets of data. But
> >>> when
> >>>> things get big, their overhead starts to hurt to much.  (Big means: so
> >>> much
> >>>> data that you'll probably not be able to fit it all in memory at the
> >> same
> >>>> time. So you need to plan on relatively frequent delays while reading
> >>> from
> >>>> disk.)
> >>>>
> >>>> One alternative to boxed strings is segmented strings. A segmented
> >> string
> >>>> is an argument which could be passed to <;._1. It's basically just a
> >>> string
> >>>> with a prefix delimiter. You can work with these sorts of strings
> >>> directly,
> >>>> and achieve results similar to what you would achieve with boxed
> >> arrays.
> >>>>
> >>>> Segmented strings are a bit clumsier than boxed arrays - you lose a
> lot
> >>> of
> >>>> the integrity checks, so if you mess up you probably will not see an
> >>> error.
> >>>> So it's probably a good idea to model your code using boxed arrays on
> a
> >>>> small set of data and then convert to segmented representation once
> >>> you're
> >>>> happy with how things work (and once you see a time cost that makes it
> >>>> worth spending the time to rework your code).
> >>>>
> >>>> Also, to avoid having to use f;._2 (or whatever) every time, it's good
> >> to
> >>>> do an initial pass on the data, to extract its structure.
> >>>>
> >>>> Here's an example:
> >>>>
> >>>> FirstName=:;LF&,each }.0{"1 table
> >>>>
> >>>> LastName=:;LF&,each }.1{"1 table
> >>>>
> >>>> Sum=:;LF&,each }.2{"1 table
> >>>>
> >>>>
> >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),#
> >>>>
> >>>> FirstNameDir=: ssdir FirstName
> >>>> LastNameDir=: ssdir LastName
> >>>>
> >>>> Actually, sum is numeric so let's just use a numeric representation
> for
> >>>> that column
> >>>>
> >>>> Sum=: _&".@> }.2{"1 table
> >>>>
> >>>> Which rows have a last name of Smith?
> >>>>
> >>>>   <:({.LastNameDir) I. I.'Smith' E. LastName
> >>>>
> >>>> 1 4
> >>>>
> >>>>
> >>>> Actually, there's an assumption there that Smith is not part of some
> >>> larger
> >>>> name. We can include the delimiter in the search if we are concerned
> >>> about
> >>>> that. For even more protection we could append a trailing delimiter on
> >>> our
> >>>> segmented string and then search for (in this case) LF,'Smith',LF.
> >>>>
> >>>>
> >>>> Anyways, let's extract the corresponding sums and first name:
> >>>>
> >>>>
> >>>>   1 4{Sum
> >>>>
> >>>> 10 10
> >>>>
> >>>>
> >>>>   FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
> >>>>
> >>>>
> >>>> Travis
> >>>>
> >>>> James
> >>>>
> >>>>
> >>>> Note that that last expression is a bit complicated. It's not so bad,
> >>>> though, if what you are extracting is a small part of the total. And,
> >> in
> >>>> that case, using a list of indices to express a boolean result seems
> >>> like a
> >>>> good thing. You wind up working with set operations (intersection and
> >>>> union) rather than logical operations (and and or). Also, set
> >> difference
> >>>> instead of logical not (dyadic -. instead of monadic -.).
> >>>>
> >>>>
> >>>> intersect=: [ -. -.
> >>>>
> >>>> union=. ~.@,
> >>>>
> >>>>
> >>>> (It looks like I might be using this kind of thing really soon, so I
> >>>> thought I'd lay down my thoughts here and invite comment.)
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Raul
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>>
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>>
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] "Segmented Strings"

Reply via email to