It's available for free now, with some limitations: http://kx.com/software-download.php
It'll take me a few years, though, to develop a fluency in K (Q actually, or kdb+ ...) which approaches my fluency in other languages. Anyways, it's not at all clear that K (or Q or KDB+) would be any better for this application than J. The grass is always greener on the other side of the fence, especially after you've crossed it? Also, if I do my job properly, the language itself becomes irrelevant and the data structures are straightforward enough to allow any arbitrary language to be used. (Meanwhile, I've got J running on OpenBSD, which pleases me.) -- Raul Thanks, -- Raul On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote: > I think I would pay for k's database capability. --Kip Murray > > Sent from my iPad > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]> wrote: > > > > I would take a look at the mapped file database lab to get ideas. > > > > - > > Björn Helgason > > gsm:6985532 > > skype:gosiminn > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote: > >> > >> I have thought about using symbols, but the only way to delete symbols > that > >> I know of involves exiting J. And, my starting premise was that I would > >> have too much data to fit into memory. > >> > >> For some computations it does make sense to start up an independent J > >> session for each part of the calculation (and, in fact, that is what I > am > >> doing in a different aspect of dealing with this dataset - it's about 10 > >> terabytes, or so I am told - I've not actually seen it all yet and it > takes > >> time to upload it). But for some calculations you need to be able to > >> correlate between pieces which have been dealt with elsewhere. > >> > >> A have similar reservations about fixed-width fields. There's just too > much > >> data for me to predict how wide the fields are going to be. In some > cases I > >> might actually be going with fixed-width, but that might be too > inefficient > >> for the general case. I've one field which would have to be over 100k in > >> width if it was fixed width, even though typical cases are shorter than > 1k. > >> At some point I might go with fixed width, and I expect that doing so > will > >> cause me to lose a few records which will be discovered later in > >> processing. That might not be a big deal, for this large of a data set, > but > >> if it's not necessary why bother? > >> > >> Finally, Bjorn's suggestion of using mapped files does seem like a good > >> idea, at least for the character data. But that is an optimization and > >> optimizations speed up some operations at the expense of slowing down > other > >> optimizations. So what really matters is the workload. > >> > >> Ultimately, for a dataset this large, it's going to take time. > >> > >> Thanks, > >> > >> -- > >> Raul > >> > >> > >> > >> > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <[email protected]> > wrote: > >>> > >>> It seems this representation is somewhat similar to how the symbol > table > >>> stores strings: > >>> > >>> http://m.jsoftware.com/help/dictionary/dsco.htm > >>> > >>> Also, did you consider using symbols? I've used symbols for string > >> columns > >>> that contain highly repetitive data, for example, an invoice table with > >> an > >>> alpha-numeric SKU. > >>> > >>> Thanks for sharing > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <[email protected]> > >> wrote: > >>> > >>>> Consider this example: > >>>> > >>>> table=:<;._2;._2]0 :0 > >>>> First Name,Last Name,Sum, > >>>> Adam,Wallace,19, > >>>> Travis,Smith,10, > >>>> Donald,Barnell,8, > >>>> Gary,Wallace,27, > >>>> James,Smith,10, > >>>> Sam,Johnson,10, > >>>> Travis,Neal,11, > >>>> Adam,Campbell,11, > >>>> Walter,Abbott,13, > >>>> ) > >>>> > >>>> Using boxed strings works great for relatively small sets of data. But > >>> when > >>>> things get big, their overhead starts to hurt to much. (Big means: so > >>> much > >>>> data that you'll probably not be able to fit it all in memory at the > >> same > >>>> time. So you need to plan on relatively frequent delays while reading > >>> from > >>>> disk.) > >>>> > >>>> One alternative to boxed strings is segmented strings. A segmented > >> string > >>>> is an argument which could be passed to <;._1. It's basically just a > >>> string > >>>> with a prefix delimiter. You can work with these sorts of strings > >>> directly, > >>>> and achieve results similar to what you would achieve with boxed > >> arrays. > >>>> > >>>> Segmented strings are a bit clumsier than boxed arrays - you lose a > lot > >>> of > >>>> the integrity checks, so if you mess up you probably will not see an > >>> error. > >>>> So it's probably a good idea to model your code using boxed arrays on > a > >>>> small set of data and then convert to segmented representation once > >>> you're > >>>> happy with how things work (and once you see a time cost that makes it > >>>> worth spending the time to rework your code). > >>>> > >>>> Also, to avoid having to use f;._2 (or whatever) every time, it's good > >> to > >>>> do an initial pass on the data, to extract its structure. > >>>> > >>>> Here's an example: > >>>> > >>>> FirstName=:;LF&,each }.0{"1 table > >>>> > >>>> LastName=:;LF&,each }.1{"1 table > >>>> > >>>> Sum=:;LF&,each }.2{"1 table > >>>> > >>>> > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),# > >>>> > >>>> FirstNameDir=: ssdir FirstName > >>>> LastNameDir=: ssdir LastName > >>>> > >>>> Actually, sum is numeric so let's just use a numeric representation > for > >>>> that column > >>>> > >>>> Sum=: _&".@> }.2{"1 table > >>>> > >>>> Which rows have a last name of Smith? > >>>> > >>>> <:({.LastNameDir) I. I.'Smith' E. LastName > >>>> > >>>> 1 4 > >>>> > >>>> > >>>> Actually, there's an assumption there that Smith is not part of some > >>> larger > >>>> name. We can include the delimiter in the search if we are concerned > >>> about > >>>> that. For even more protection we could append a trailing delimiter on > >>> our > >>>> segmented string and then search for (in this case) LF,'Smith',LF. > >>>> > >>>> > >>>> Anyways, let's extract the corresponding sums and first name: > >>>> > >>>> > >>>> 1 4{Sum > >>>> > >>>> 10 10 > >>>> > >>>> > >>>> FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir > >>>> > >>>> > >>>> Travis > >>>> > >>>> James > >>>> > >>>> > >>>> Note that that last expression is a bit complicated. It's not so bad, > >>>> though, if what you are extracting is a small part of the total. And, > >> in > >>>> that case, using a list of indices to express a boolean result seems > >>> like a > >>>> good thing. You wind up working with set operations (intersection and > >>>> union) rather than logical operations (and and or). Also, set > >> difference > >>>> instead of logical not (dyadic -. instead of monadic -.). > >>>> > >>>> > >>>> intersect=: [ -. -. > >>>> > >>>> union=. ~.@, > >>>> > >>>> > >>>> (It looks like I might be using this kind of thing really soon, so I > >>>> thought I'd lay down my thoughts here and invite comment.) > >>>> > >>>> > >>>> Thanks, > >>>> > >>>> > >>>> -- > >>>> > >>>> Raul > >>>> ---------------------------------------------------------------------- > >>>> For information about J forums see > http://www.jsoftware.com/forums.htm > >>>> > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >>> > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
