It seems this representation is somewhat similar to how the symbol table stores strings:
http://m.jsoftware.com/help/dictionary/dsco.htm Also, did you consider using symbols? I've used symbols for string columns that contain highly repetitive data, for example, an invoice table with an alpha-numeric SKU. Thanks for sharing On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]> wrote: > Consider this example: > > table=:<;._2;._2]0 :0 > First Name,Last Name,Sum, > Adam,Wallace,19, > Travis,Smith,10, > Donald,Barnell,8, > Gary,Wallace,27, > James,Smith,10, > Sam,Johnson,10, > Travis,Neal,11, > Adam,Campbell,11, > Walter,Abbott,13, > ) > > Using boxed strings works great for relatively small sets of data. But when > things get big, their overhead starts to hurt to much. (Big means: so much > data that you'll probably not be able to fit it all in memory at the same > time. So you need to plan on relatively frequent delays while reading from > disk.) > > One alternative to boxed strings is segmented strings. A segmented string > is an argument which could be passed to <;._1. It's basically just a string > with a prefix delimiter. You can work with these sorts of strings directly, > and achieve results similar to what you would achieve with boxed arrays. > > Segmented strings are a bit clumsier than boxed arrays - you lose a lot of > the integrity checks, so if you mess up you probably will not see an error. > So it's probably a good idea to model your code using boxed arrays on a > small set of data and then convert to segmented representation once you're > happy with how things work (and once you see a time cost that makes it > worth spending the time to rework your code). > > Also, to avoid having to use f;._2 (or whatever) every time, it's good to > do an initial pass on the data, to extract its structure. > > Here's an example: > > FirstName=:;LF&,each }.0{"1 table > > LastName=:;LF&,each }.1{"1 table > > Sum=:;LF&,each }.2{"1 table > > > ssdir=: [:(}:,:2-~/\])I.@(= {.),# > > FirstNameDir=: ssdir FirstName > LastNameDir=: ssdir LastName > > Actually, sum is numeric so let's just use a numeric representation for > that column > > Sum=: _&".@> }.2{"1 table > > Which rows have a last name of Smith? > > <:({.LastNameDir) I. I.'Smith' E. LastName > > 1 4 > > > Actually, there's an assumption there that Smith is not part of some larger > name. We can include the delimiter in the search if we are concerned about > that. For even more protection we could append a trailing delimiter on our > segmented string and then search for (in this case) LF,'Smith',LF. > > > Anyways, let's extract the corresponding sums and first name: > > > 1 4{Sum > > 10 10 > > > FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir > > > Travis > > James > > > Note that that last expression is a bit complicated. It's not so bad, > though, if what you are extracting is a small part of the total. And, in > that case, using a list of indices to express a boolean result seems like a > good thing. You wind up working with set operations (intersection and > union) rather than logical operations (and and or). Also, set difference > instead of logical not (dyadic -. instead of monadic -.). > > > intersect=: [ -. -. > > union=. ~.@, > > > (It looks like I might be using this kind of thing really soon, so I > thought I'd lay down my thoughts here and invite comment.) > > > Thanks, > > > -- > > Raul > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
