I would take a look at the mapped file database lab to get ideas. - Björn Helgason gsm:6985532 skype:gosiminn On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote:
> I have thought about using symbols, but the only way to delete symbols that > I know of involves exiting J. And, my starting premise was that I would > have too much data to fit into memory. > > For some computations it does make sense to start up an independent J > session for each part of the calculation (and, in fact, that is what I am > doing in a different aspect of dealing with this dataset - it's about 10 > terabytes, or so I am told - I've not actually seen it all yet and it takes > time to upload it). But for some calculations you need to be able to > correlate between pieces which have been dealt with elsewhere. > > A have similar reservations about fixed-width fields. There's just too much > data for me to predict how wide the fields are going to be. In some cases I > might actually be going with fixed-width, but that might be too inefficient > for the general case. I've one field which would have to be over 100k in > width if it was fixed width, even though typical cases are shorter than 1k. > At some point I might go with fixed width, and I expect that doing so will > cause me to lose a few records which will be discovered later in > processing. That might not be a big deal, for this large of a data set, but > if it's not necessary why bother? > > Finally, Bjorn's suggestion of using mapped files does seem like a good > idea, at least for the character data. But that is an optimization and > optimizations speed up some operations at the expense of slowing down other > optimizations. So what really matters is the workload. > > Ultimately, for a dataset this large, it's going to take time. > > Thanks, > > -- > Raul > > > > > On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <[email protected]> wrote: > > > It seems this representation is somewhat similar to how the symbol table > > stores strings: > > > > http://m.jsoftware.com/help/dictionary/dsco.htm > > > > Also, did you consider using symbols? I've used symbols for string > columns > > that contain highly repetitive data, for example, an invoice table with > an > > alpha-numeric SKU. > > > > Thanks for sharing > > > > > > > > > > > > > > On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]> > wrote: > > > > > Consider this example: > > > > > > table=:<;._2;._2]0 :0 > > > First Name,Last Name,Sum, > > > Adam,Wallace,19, > > > Travis,Smith,10, > > > Donald,Barnell,8, > > > Gary,Wallace,27, > > > James,Smith,10, > > > Sam,Johnson,10, > > > Travis,Neal,11, > > > Adam,Campbell,11, > > > Walter,Abbott,13, > > > ) > > > > > > Using boxed strings works great for relatively small sets of data. But > > when > > > things get big, their overhead starts to hurt to much. (Big means: so > > much > > > data that you'll probably not be able to fit it all in memory at the > same > > > time. So you need to plan on relatively frequent delays while reading > > from > > > disk.) > > > > > > One alternative to boxed strings is segmented strings. A segmented > string > > > is an argument which could be passed to <;._1. It's basically just a > > string > > > with a prefix delimiter. You can work with these sorts of strings > > directly, > > > and achieve results similar to what you would achieve with boxed > arrays. > > > > > > Segmented strings are a bit clumsier than boxed arrays - you lose a lot > > of > > > the integrity checks, so if you mess up you probably will not see an > > error. > > > So it's probably a good idea to model your code using boxed arrays on a > > > small set of data and then convert to segmented representation once > > you're > > > happy with how things work (and once you see a time cost that makes it > > > worth spending the time to rework your code). > > > > > > Also, to avoid having to use f;._2 (or whatever) every time, it's good > to > > > do an initial pass on the data, to extract its structure. > > > > > > Here's an example: > > > > > > FirstName=:;LF&,each }.0{"1 table > > > > > > LastName=:;LF&,each }.1{"1 table > > > > > > Sum=:;LF&,each }.2{"1 table > > > > > > > > > ssdir=: [:(}:,:2-~/\])I.@(= {.),# > > > > > > FirstNameDir=: ssdir FirstName > > > LastNameDir=: ssdir LastName > > > > > > Actually, sum is numeric so let's just use a numeric representation for > > > that column > > > > > > Sum=: _&".@> }.2{"1 table > > > > > > Which rows have a last name of Smith? > > > > > > <:({.LastNameDir) I. I.'Smith' E. LastName > > > > > > 1 4 > > > > > > > > > Actually, there's an assumption there that Smith is not part of some > > larger > > > name. We can include the delimiter in the search if we are concerned > > about > > > that. For even more protection we could append a trailing delimiter on > > our > > > segmented string and then search for (in this case) LF,'Smith',LF. > > > > > > > > > Anyways, let's extract the corresponding sums and first name: > > > > > > > > > 1 4{Sum > > > > > > 10 10 > > > > > > > > > FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir > > > > > > > > > Travis > > > > > > James > > > > > > > > > Note that that last expression is a bit complicated. It's not so bad, > > > though, if what you are extracting is a small part of the total. And, > in > > > that case, using a list of indices to express a boolean result seems > > like a > > > good thing. You wind up working with set operations (intersection and > > > union) rather than logical operations (and and or). Also, set > difference > > > instead of logical not (dyadic -. instead of monadic -.). > > > > > > > > > intersect=: [ -. -. > > > > > > union=. ~.@, > > > > > > > > > (It looks like I might be using this kind of thing really soon, so I > > > thought I'd lay down my thoughts here and invite comment.) > > > > > > > > > Thanks, > > > > > > > > > -- > > > > > > Raul > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
