How? Thanks,
-- Raul On Wed, Apr 9, 2014 at 3:44 AM, Linda Alvord <[email protected]>wrote: > I would not get rid of your table made of strings. I would access it in > the form of J tables because that is what J does nicely. > > Linda > > -----Original Message----- > From: [email protected] [mailto: > [email protected]] On Behalf Of Raul Miller > Sent: Wednesday, April 09, 2014 2:48 AM > To: Programming forum > Subject: Re: [Jprogramming] "Segmented Strings" > > The plan is that segmented strings are the data in the database. > > There's just too much information to hold it all in memory on a single > machine. > > Thanks, > > -- > Raul > > > On Wed, Apr 9, 2014 at 2:23 AM, Linda Alvord <[email protected] > >wrote: > > > I know almost nothing about large databases, but what is the advantage of > > staying with sstrings after the data base is built? > > > > Once you have your table, or maybe two or more tables of character and > > numeric data, you might "stay in J and make "subtables" which can be > > catenated together and destroyed as needed. You could also do selections > > of subsets more easily. > > > > ]FirstName=:;LF&,each }.0{"1 table > > > > Adam > > Travis > > Donald > > Gary > > James > > Sam > > Travis > > Adam > > Walter > > > > ]FN2=: >"0 }.0{"1 table > > Adam > > Travis > > Donald > > Gary > > James > > Sam > > Travis > > Adam > > Walter > > > > FN2-:FirstName > > 0 > > $FirstName > > 53 > > $FN2 > > 9 6 > > > > Linda > > > > > > -----Original Message----- > > From: [email protected] [mailto: > > [email protected]] On Behalf Of Raul Miller > > Sent: Tuesday, April 08, 2014 8:22 PM > > To: Programming forum > > Subject: Re: [Jprogramming] "Segmented Strings" > > > > I might indeed do that, but in some cases the time to read the file > itself > > will be mostly network transfer time. And, once it's in memory, how it > got > > there isn't really an issue. > > > > Still, it's worth benchmarking. > > > > Thanks, > > > > -- > > Raul > > > > > > On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]> > wrote: > > > > > I second memory mapped files and mapped file database. > > > > > > > > > On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller <[email protected]> > > wrote: > > > > > > > It's available for free now, with some limitations: > > > > > > > > http://kx.com/software-download.php > > > > > > > > It'll take me a few years, though, to develop a fluency in K (Q > > actually, > > > > or kdb+ ...) which approaches my fluency in other languages. Anyways, > > > it's > > > > not at all clear that K (or Q or KDB+) would be any better for this > > > > application than J. The grass is always greener on the other side of > > the > > > > fence, especially after you've crossed it? > > > > > > > > Also, if I do my job properly, the language itself becomes irrelevant > > and > > > > the data structures are straightforward enough to allow any arbitrary > > > > language to be used. > > > > > > > > (Meanwhile, I've got J running on OpenBSD, which pleases me.) > > > > > > > > -- > > > > Raul > > > > > > > > Thanks, > > > > > > > > -- > > > > Raul > > > > > > > > > > > > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote: > > > > > > > > > I think I would pay for k's database capability. --Kip Murray > > > > > > > > > > Sent from my iPad > > > > > > > > > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]> > > > wrote: > > > > > > > > > > > > I would take a look at the mapped file database lab to get ideas. > > > > > > > > > > > > - > > > > > > Björn Helgason > > > > > > gsm:6985532 > > > > > > skype:gosiminn > > > > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote: > > > > > >> > > > > > >> I have thought about using symbols, but the only way to delete > > > symbols > > > > > that > > > > > >> I know of involves exiting J. And, my starting premise was that > I > > > > would > > > > > >> have too much data to fit into memory. > > > > > >> > > > > > >> For some computations it does make sense to start up an > > independent > > > J > > > > > >> session for each part of the calculation (and, in fact, that is > > > what I > > > > > am > > > > > >> doing in a different aspect of dealing with this dataset - it's > > > about > > > > 10 > > > > > >> terabytes, or so I am told - I've not actually seen it all yet > and > > > it > > > > > takes > > > > > >> time to upload it). But for some calculations you need to be > able > > to > > > > > >> correlate between pieces which have been dealt with elsewhere. > > > > > >> > > > > > >> A have similar reservations about fixed-width fields. There's > just > > > too > > > > > much > > > > > >> data for me to predict how wide the fields are going to be. In > > some > > > > > cases I > > > > > >> might actually be going with fixed-width, but that might be too > > > > > inefficient > > > > > >> for the general case. I've one field which would have to be over > > > 100k > > > > in > > > > > >> width if it was fixed width, even though typical cases are > shorter > > > > than > > > > > 1k. > > > > > >> At some point I might go with fixed width, and I expect that > doing > > > so > > > > > will > > > > > >> cause me to lose a few records which will be discovered later in > > > > > >> processing. That might not be a big deal, for this large of a > data > > > > set, > > > > > but > > > > > >> if it's not necessary why bother? > > > > > >> > > > > > >> Finally, Bjorn's suggestion of using mapped files does seem > like a > > > > good > > > > > >> idea, at least for the character data. But that is an > optimization > > > and > > > > > >> optimizations speed up some operations at the expense of slowing > > > down > > > > > other > > > > > >> optimizations. So what really matters is the workload. > > > > > >> > > > > > >> Ultimately, for a dataset this large, it's going to take time. > > > > > >> > > > > > >> Thanks, > > > > > >> > > > > > >> -- > > > > > >> Raul > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner < > [email protected]> > > > > > wrote: > > > > > >>> > > > > > >>> It seems this representation is somewhat similar to how the > > symbol > > > > > table > > > > > >>> stores strings: > > > > > >>> > > > > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm > > > > > >>> > > > > > >>> Also, did you consider using symbols? I've used symbols for > > string > > > > > >> columns > > > > > >>> that contain highly repetitive data, for example, an invoice > > table > > > > with > > > > > >> an > > > > > >>> alpha-numeric SKU. > > > > > >>> > > > > > >>> Thanks for sharing > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller < > > [email protected] > > > > > > > > > >> wrote: > > > > > >>> > > > > > >>>> Consider this example: > > > > > >>>> > > > > > >>>> table=:<;._2;._2]0 :0 > > > > > >>>> First Name,Last Name,Sum, > > > > > >>>> Adam,Wallace,19, > > > > > >>>> Travis,Smith,10, > > > > > >>>> Donald,Barnell,8, > > > > > >>>> Gary,Wallace,27, > > > > > >>>> James,Smith,10, > > > > > >>>> Sam,Johnson,10, > > > > > >>>> Travis,Neal,11, > > > > > >>>> Adam,Campbell,11, > > > > > >>>> Walter,Abbott,13, > > > > > >>>> ) > > > > > >>>> > > > > > >>>> Using boxed strings works great for relatively small sets of > > data. > > > > But > > > > > >>> when > > > > > >>>> things get big, their overhead starts to hurt to much. (Big > > > means: > > > > so > > > > > >>> much > > > > > >>>> data that you'll probably not be able to fit it all in memory > at > > > the > > > > > >> same > > > > > >>>> time. So you need to plan on relatively frequent delays while > > > > reading > > > > > >>> from > > > > > >>>> disk.) > > > > > >>>> > > > > > >>>> One alternative to boxed strings is segmented strings. A > > segmented > > > > > >> string > > > > > >>>> is an argument which could be passed to <;._1. It's basically > > > just a > > > > > >>> string > > > > > >>>> with a prefix delimiter. You can work with these sorts of > > strings > > > > > >>> directly, > > > > > >>>> and achieve results similar to what you would achieve with > boxed > > > > > >> arrays. > > > > > >>>> > > > > > >>>> Segmented strings are a bit clumsier than boxed arrays - you > > lose > > > a > > > > > lot > > > > > >>> of > > > > > >>>> the integrity checks, so if you mess up you probably will not > > see > > > an > > > > > >>> error. > > > > > >>>> So it's probably a good idea to model your code using boxed > > arrays > > > > on > > > > > a > > > > > >>>> small set of data and then convert to segmented representation > > > once > > > > > >>> you're > > > > > >>>> happy with how things work (and once you see a time cost that > > > makes > > > > it > > > > > >>>> worth spending the time to rework your code). > > > > > >>>> > > > > > >>>> Also, to avoid having to use f;._2 (or whatever) every time, > > it's > > > > good > > > > > >> to > > > > > >>>> do an initial pass on the data, to extract its structure. > > > > > >>>> > > > > > >>>> Here's an example: > > > > > >>>> > > > > > >>>> FirstName=:;LF&,each }.0{"1 table > > > > > >>>> > > > > > >>>> LastName=:;LF&,each }.1{"1 table > > > > > >>>> > > > > > >>>> Sum=:;LF&,each }.2{"1 table > > > > > >>>> > > > > > >>>> > > > > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),# > > > > > >>>> > > > > > >>>> FirstNameDir=: ssdir FirstName > > > > > >>>> LastNameDir=: ssdir LastName > > > > > >>>> > > > > > >>>> Actually, sum is numeric so let's just use a numeric > > > representation > > > > > for > > > > > >>>> that column > > > > > >>>> > > > > > >>>> Sum=: _&".@> }.2{"1 table > > > > > >>>> > > > > > >>>> Which rows have a last name of Smith? > > > > > >>>> > > > > > >>>> <:({.LastNameDir) I. I.'Smith' E. LastName > > > > > >>>> > > > > > >>>> 1 4 > > > > > >>>> > > > > > >>>> > > > > > >>>> Actually, there's an assumption there that Smith is not part > of > > > some > > > > > >>> larger > > > > > >>>> name. We can include the delimiter in the search if we are > > > concerned > > > > > >>> about > > > > > >>>> that. For even more protection we could append a trailing > > > delimiter > > > > on > > > > > >>> our > > > > > >>>> segmented string and then search for (in this case) > > LF,'Smith',LF. > > > > > >>>> > > > > > >>>> > > > > > >>>> Anyways, let's extract the corresponding sums and first name: > > > > > >>>> > > > > > >>>> > > > > > >>>> 1 4{Sum > > > > > >>>> > > > > > >>>> 10 10 > > > > > >>>> > > > > > >>>> > > > > > >>>> FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir > > > > > >>>> > > > > > >>>> > > > > > >>>> Travis > > > > > >>>> > > > > > >>>> James > > > > > >>>> > > > > > >>>> > > > > > >>>> Note that that last expression is a bit complicated. It's not > so > > > > bad, > > > > > >>>> though, if what you are extracting is a small part of the > total. > > > > And, > > > > > >> in > > > > > >>>> that case, using a list of indices to express a boolean result > > > seems > > > > > >>> like a > > > > > >>>> good thing. You wind up working with set operations > > (intersection > > > > and > > > > > >>>> union) rather than logical operations (and and or). Also, set > > > > > >> difference > > > > > >>>> instead of logical not (dyadic -. instead of monadic -.). > > > > > >>>> > > > > > >>>> > > > > > >>>> intersect=: [ -. -. > > > > > >>>> > > > > > >>>> union=. ~.@, > > > > > >>>> > > > > > >>>> > > > > > >>>> (It looks like I might be using this kind of thing really > soon, > > > so I > > > > > >>>> thought I'd lay down my thoughts here and invite comment.) > > > > > >>>> > > > > > >>>> > > > > > >>>> Thanks, > > > > > >>>> > > > > > >>>> > > > > > >>>> -- > > > > > >>>> > > > > > >>>> Raul > > > > > >>>> > > > > > ---------------------------------------------------------------------- > > > > > >>>> For information about J forums see > > > > > http://www.jsoftware.com/forums.htm > > > > > >>>> > > > > > >>> > > > > > ---------------------------------------------------------------------- > > > > > >>> For information about J forums see > > > > http://www.jsoftware.com/forums.htm > > > > > >>> > > > > > >> > > > ---------------------------------------------------------------------- > > > > > >> For information about J forums see > > > > http://www.jsoftware.com/forums.htm > > > > > >> > > > > > > > > > ---------------------------------------------------------------------- > > > > > > For information about J forums see > > > http://www.jsoftware.com/forums.htm > > > > > > > ---------------------------------------------------------------------- > > > > > For information about J forums see > > http://www.jsoftware.com/forums.htm > > > > > > > > > > ---------------------------------------------------------------------- > > > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > > > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
