It seems this representation is somewhat similar to how the symbol table
stores strings:

http://m.jsoftware.com/help/dictionary/dsco.htm

Also, did you consider using symbols? I've used symbols for string columns
that contain highly repetitive data, for example, an invoice table with an
alpha-numeric SKU.

Thanks for sharing






On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]> wrote:

> Consider this example:
>
> table=:<;._2;._2]0 :0
> First Name,Last Name,Sum,
> Adam,Wallace,19,
> Travis,Smith,10,
> Donald,Barnell,8,
> Gary,Wallace,27,
> James,Smith,10,
> Sam,Johnson,10,
> Travis,Neal,11,
> Adam,Campbell,11,
> Walter,Abbott,13,
> )
>
> Using boxed strings works great for relatively small sets of data. But when
> things get big, their overhead starts to hurt to much.  (Big means: so much
> data that you'll probably not be able to fit it all in memory at the same
> time. So you need to plan on relatively frequent delays while reading from
> disk.)
>
> One alternative to boxed strings is segmented strings. A segmented string
> is an argument which could be passed to <;._1. It's basically just a string
> with a prefix delimiter. You can work with these sorts of strings directly,
> and achieve results similar to what you would achieve with boxed arrays.
>
> Segmented strings are a bit clumsier than boxed arrays - you lose a lot of
> the integrity checks, so if you mess up you probably will not see an error.
> So it's probably a good idea to model your code using boxed arrays on a
> small set of data and then convert to segmented representation once you're
> happy with how things work (and once you see a time cost that makes it
> worth spending the time to rework your code).
>
> Also, to avoid having to use f;._2 (or whatever) every time, it's good to
> do an initial pass on the data, to extract its structure.
>
> Here's an example:
>
> FirstName=:;LF&,each }.0{"1 table
>
> LastName=:;LF&,each }.1{"1 table
>
> Sum=:;LF&,each }.2{"1 table
>
>
> ssdir=: [:(}:,:2-~/\])I.@(= {.),#
>
> FirstNameDir=: ssdir FirstName
> LastNameDir=: ssdir LastName
>
> Actually, sum is numeric so let's just use a numeric representation for
> that column
>
> Sum=: _&".@> }.2{"1 table
>
> Which rows have a last name of Smith?
>
>    <:({.LastNameDir) I. I.'Smith' E. LastName
>
> 1 4
>
>
> Actually, there's an assumption there that Smith is not part of some larger
> name. We can include the delimiter in the search if we are concerned about
> that. For even more protection we could append a trailing delimiter on our
> segmented string and then search for (in this case) LF,'Smith',LF.
>
>
> Anyways, let's extract the corresponding sums and first name:
>
>
>    1 4{Sum
>
> 10 10
>
>
>    FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
>
>
> Travis
>
> James
>
>
> Note that that last expression is a bit complicated. It's not so bad,
> though, if what you are extracting is a small part of the total. And, in
> that case, using a list of indices to express a boolean result seems like a
> good thing. You wind up working with set operations (intersection and
> union) rather than logical operations (and and or). Also, set difference
> instead of logical not (dyadic -. instead of monadic -.).
>
>
> intersect=: [ -. -.
>
> union=. ~.@,
>
>
> (It looks like I might be using this kind of thing really soon, so I
> thought I'd lay down my thoughts here and invite comment.)
>
>
> Thanks,
>
>
> --
>
> Raul
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to