Your example FirstName=:;LF&,each }.0{"1 table is a string creation.
Mine ]FN2=: >"0 }.0{"1 table is a table.
If you create tables of character dat and tables of the numeric data
separately, you could transform the numeric data and then join columns to
columns or rows to rows.
More dimensions could be created as well and then joined in ways to summarize
the useful data and finally rejoin the results.
My suggestion is really only related to giving thought to how best to extract
and use the string table you have created.
Linda
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Raul Miller
Sent: Wednesday, April 09, 2014 10:06 AM
To: Programming forum
Subject: Re: [Jprogramming] "Segmented Strings"
How?
Thanks,
--
Raul
On Wed, Apr 9, 2014 at 3:44 AM, Linda Alvord <[email protected]>wrote:
> I would not get rid of your table made of strings. I would access it in
> the form of J tables because that is what J does nicely.
>
> Linda
>
> -----Original Message-----
> From: [email protected] [mailto:
> [email protected]] On Behalf Of Raul Miller
> Sent: Wednesday, April 09, 2014 2:48 AM
> To: Programming forum
> Subject: Re: [Jprogramming] "Segmented Strings"
>
> The plan is that segmented strings are the data in the database.
>
> There's just too much information to hold it all in memory on a single
> machine.
>
> Thanks,
>
> --
> Raul
>
>
> On Wed, Apr 9, 2014 at 2:23 AM, Linda Alvord <[email protected]
> >wrote:
>
> > I know almost nothing about large databases, but what is the advantage of
> > staying with sstrings after the data base is built?
> >
> > Once you have your table, or maybe two or more tables of character and
> > numeric data, you might "stay in J and make "subtables" which can be
> > catenated together and destroyed as needed. You could also do selections
> > of subsets more easily.
> >
> > ]FirstName=:;LF&,each }.0{"1 table
> >
> > Adam
> > Travis
> > Donald
> > Gary
> > James
> > Sam
> > Travis
> > Adam
> > Walter
> >
> > ]FN2=: >"0 }.0{"1 table
> > Adam
> > Travis
> > Donald
> > Gary
> > James
> > Sam
> > Travis
> > Adam
> > Walter
> >
> > FN2-:FirstName
> > 0
> > $FirstName
> > 53
> > $FN2
> > 9 6
> >
> > Linda
> >
> >
> > -----Original Message-----
> > From: [email protected] [mailto:
> > [email protected]] On Behalf Of Raul Miller
> > Sent: Tuesday, April 08, 2014 8:22 PM
> > To: Programming forum
> > Subject: Re: [Jprogramming] "Segmented Strings"
> >
> > I might indeed do that, but in some cases the time to read the file
> itself
> > will be mostly network transfer time. And, once it's in memory, how it
> got
> > there isn't really an issue.
> >
> > Still, it's worth benchmarking.
> >
> > Thanks,
> >
> > --
> > Raul
> >
> >
> > On Tue, Apr 8, 2014 at 8:18 PM, Vijay Lulla <[email protected]>
> wrote:
> >
> > > I second memory mapped files and mapped file database.
> > >
> > >
> > > On Tue, Apr 8, 2014 at 4:51 PM, Raul Miller <[email protected]>
> > wrote:
> > >
> > > > It's available for free now, with some limitations:
> > > >
> > > > http://kx.com/software-download.php
> > > >
> > > > It'll take me a few years, though, to develop a fluency in K (Q
> > actually,
> > > > or kdb+ ...) which approaches my fluency in other languages. Anyways,
> > > it's
> > > > not at all clear that K (or Q or KDB+) would be any better for this
> > > > application than J. The grass is always greener on the other side of
> > the
> > > > fence, especially after you've crossed it?
> > > >
> > > > Also, if I do my job properly, the language itself becomes irrelevant
> > and
> > > > the data structures are straightforward enough to allow any arbitrary
> > > > language to be used.
> > > >
> > > > (Meanwhile, I've got J running on OpenBSD, which pleases me.)
> > > >
> > > > --
> > > > Raul
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Raul
> > > >
> > > >
> > > > On Tue, Apr 8, 2014 at 2:54 PM, km <[email protected]> wrote:
> > > >
> > > > > I think I would pay for k's database capability. --Kip Murray
> > > > >
> > > > > Sent from my iPad
> > > > >
> > > > > > On Apr 8, 2014, at 12:46 PM, Björn Helgason <[email protected]>
> > > wrote:
> > > > > >
> > > > > > I would take a look at the mapped file database lab to get ideas.
> > > > > >
> > > > > > -
> > > > > > Björn Helgason
> > > > > > gsm:6985532
> > > > > > skype:gosiminn
> > > > > >> On 8.4.2014 15:34, "Raul Miller" <[email protected]> wrote:
> > > > > >>
> > > > > >> I have thought about using symbols, but the only way to delete
> > > symbols
> > > > > that
> > > > > >> I know of involves exiting J. And, my starting premise was that
> I
> > > > would
> > > > > >> have too much data to fit into memory.
> > > > > >>
> > > > > >> For some computations it does make sense to start up an
> > independent
> > > J
> > > > > >> session for each part of the calculation (and, in fact, that is
> > > what I
> > > > > am
> > > > > >> doing in a different aspect of dealing with this dataset - it's
> > > about
> > > > 10
> > > > > >> terabytes, or so I am told - I've not actually seen it all yet
> and
> > > it
> > > > > takes
> > > > > >> time to upload it). But for some calculations you need to be
> able
> > to
> > > > > >> correlate between pieces which have been dealt with elsewhere.
> > > > > >>
> > > > > >> A have similar reservations about fixed-width fields. There's
> just
> > > too
> > > > > much
> > > > > >> data for me to predict how wide the fields are going to be. In
> > some
> > > > > cases I
> > > > > >> might actually be going with fixed-width, but that might be too
> > > > > inefficient
> > > > > >> for the general case. I've one field which would have to be over
> > > 100k
> > > > in
> > > > > >> width if it was fixed width, even though typical cases are
> shorter
> > > > than
> > > > > 1k.
> > > > > >> At some point I might go with fixed width, and I expect that
> doing
> > > so
> > > > > will
> > > > > >> cause me to lose a few records which will be discovered later in
> > > > > >> processing. That might not be a big deal, for this large of a
> data
> > > > set,
> > > > > but
> > > > > >> if it's not necessary why bother?
> > > > > >>
> > > > > >> Finally, Bjorn's suggestion of using mapped files does seem
> like a
> > > > good
> > > > > >> idea, at least for the character data. But that is an
> optimization
> > > and
> > > > > >> optimizations speed up some operations at the expense of slowing
> > > down
> > > > > other
> > > > > >> optimizations. So what really matters is the workload.
> > > > > >>
> > > > > >> Ultimately, for a dataset this large, it's going to take time.
> > > > > >>
> > > > > >> Thanks,
> > > > > >>
> > > > > >> --
> > > > > >> Raul
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>> On Tue, Apr 8, 2014 at 6:06 AM, Joe Bogner <
> [email protected]>
> > > > > wrote:
> > > > > >>>
> > > > > >>> It seems this representation is somewhat similar to how the
> > symbol
> > > > > table
> > > > > >>> stores strings:
> > > > > >>>
> > > > > >>> http://m.jsoftware.com/help/dictionary/dsco.htm
> > > > > >>>
> > > > > >>> Also, did you consider using symbols? I've used symbols for
> > string
> > > > > >> columns
> > > > > >>> that contain highly repetitive data, for example, an invoice
> > table
> > > > with
> > > > > >> an
> > > > > >>> alpha-numeric SKU.
> > > > > >>>
> > > > > >>> Thanks for sharing
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Tue, Apr 8, 2014 at 2:40 AM, Raul Miller <
> > [email protected]
> > > >
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Consider this example:
> > > > > >>>>
> > > > > >>>> table=:<;._2;._2]0 :0
> > > > > >>>> First Name,Last Name,Sum,
> > > > > >>>> Adam,Wallace,19,
> > > > > >>>> Travis,Smith,10,
> > > > > >>>> Donald,Barnell,8,
> > > > > >>>> Gary,Wallace,27,
> > > > > >>>> James,Smith,10,
> > > > > >>>> Sam,Johnson,10,
> > > > > >>>> Travis,Neal,11,
> > > > > >>>> Adam,Campbell,11,
> > > > > >>>> Walter,Abbott,13,
> > > > > >>>> )
> > > > > >>>>
> > > > > >>>> Using boxed strings works great for relatively small sets of
> > data.
> > > > But
> > > > > >>> when
> > > > > >>>> things get big, their overhead starts to hurt to much. (Big
> > > means:
> > > > so
> > > > > >>> much
> > > > > >>>> data that you'll probably not be able to fit it all in memory
> at
> > > the
> > > > > >> same
> > > > > >>>> time. So you need to plan on relatively frequent delays while
> > > > reading
> > > > > >>> from
> > > > > >>>> disk.)
> > > > > >>>>
> > > > > >>>> One alternative to boxed strings is segmented strings. A
> > segmented
> > > > > >> string
> > > > > >>>> is an argument which could be passed to <;._1. It's basically
> > > just a
> > > > > >>> string
> > > > > >>>> with a prefix delimiter. You can work with these sorts of
> > strings
> > > > > >>> directly,
> > > > > >>>> and achieve results similar to what you would achieve with
> boxed
> > > > > >> arrays.
> > > > > >>>>
> > > > > >>>> Segmented strings are a bit clumsier than boxed arrays - you
> > lose
> > > a
> > > > > lot
> > > > > >>> of
> > > > > >>>> the integrity checks, so if you mess up you probably will not
> > see
> > > an
> > > > > >>> error.
> > > > > >>>> So it's probably a good idea to model your code using boxed
> > arrays
> > > > on
> > > > > a
> > > > > >>>> small set of data and then convert to segmented representation
> > > once
> > > > > >>> you're
> > > > > >>>> happy with how things work (and once you see a time cost that
> > > makes
> > > > it
> > > > > >>>> worth spending the time to rework your code).
> > > > > >>>>
> > > > > >>>> Also, to avoid having to use f;._2 (or whatever) every time,
> > it's
> > > > good
> > > > > >> to
> > > > > >>>> do an initial pass on the data, to extract its structure.
> > > > > >>>>
> > > > > >>>> Here's an example:
> > > > > >>>>
> > > > > >>>> FirstName=:;LF&,each }.0{"1 table
> > > > > >>>>
> > > > > >>>> LastName=:;LF&,each }.1{"1 table
> > > > > >>>>
> > > > > >>>> Sum=:;LF&,each }.2{"1 table
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> ssdir=: [:(}:,:2-~/\])I.@(= {.),#
> > > > > >>>>
> > > > > >>>> FirstNameDir=: ssdir FirstName
> > > > > >>>> LastNameDir=: ssdir LastName
> > > > > >>>>
> > > > > >>>> Actually, sum is numeric so let's just use a numeric
> > > representation
> > > > > for
> > > > > >>>> that column
> > > > > >>>>
> > > > > >>>> Sum=: _&".@> }.2{"1 table
> > > > > >>>>
> > > > > >>>> Which rows have a last name of Smith?
> > > > > >>>>
> > > > > >>>> <:({.LastNameDir) I. I.'Smith' E. LastName
> > > > > >>>>
> > > > > >>>> 1 4
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Actually, there's an assumption there that Smith is not part
> of
> > > some
> > > > > >>> larger
> > > > > >>>> name. We can include the delimiter in the search if we are
> > > concerned
> > > > > >>> about
> > > > > >>>> that. For even more protection we could append a trailing
> > > delimiter
> > > > on
> > > > > >>> our
> > > > > >>>> segmented string and then search for (in this case)
> > LF,'Smith',LF.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Anyways, let's extract the corresponding sums and first name:
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> 1 4{Sum
> > > > > >>>>
> > > > > >>>> 10 10
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> FirstName{~;<@(+ i.)/"1|:1 4 {"1 FirstNameDir
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Travis
> > > > > >>>>
> > > > > >>>> James
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Note that that last expression is a bit complicated. It's not
> so
> > > > bad,
> > > > > >>>> though, if what you are extracting is a small part of the
> total.
> > > > And,
> > > > > >> in
> > > > > >>>> that case, using a list of indices to express a boolean result
> > > seems
> > > > > >>> like a
> > > > > >>>> good thing. You wind up working with set operations
> > (intersection
> > > > and
> > > > > >>>> union) rather than logical operations (and and or). Also, set
> > > > > >> difference
> > > > > >>>> instead of logical not (dyadic -. instead of monadic -.).
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> intersect=: [ -. -.
> > > > > >>>>
> > > > > >>>> union=. ~.@,
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> (It looks like I might be using this kind of thing really
> soon,
> > > so I
> > > > > >>>> thought I'd lay down my thoughts here and invite comment.)
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Thanks,
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>>
> > > > > >>>> Raul
> > > > > >>>>
> > > >
> ----------------------------------------------------------------------
> > > > > >>>> For information about J forums see
> > > > > http://www.jsoftware.com/forums.htm
> > > > > >>>>
> > > > > >>>
> > > >
> ----------------------------------------------------------------------
> > > > > >>> For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > > > >>>
> > > > > >>
> > > ----------------------------------------------------------------------
> > > > > >> For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > > > >>
> > > > > >
> > > ----------------------------------------------------------------------
> > > > > > For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > > > >
> > ----------------------------------------------------------------------
> > > > > For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > > >
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm