Conceptually speaking, at least two strings are unique in (almost) every row. Others are duplicated, yes.
Meanwhile, a significant part of this code also needs to run on 32 bit J (because j602 is apparently the only version of J which supports xml/sax). It's actually grinding through this data one row at a time (appending each to file as it completes). And I am sharing code between the two stages of processing, because tested code tends to be more robust than untested code. Anyways, it's an interesting idea, and if I was really pressed for efficiency, I would try that. But the bulk of the cpu time is not here, it's several orders of magnitude larger in that xml parsing stage. (And I have been tempted to offload some of that to python or whatever, but there were some things that I was doing in J that I did not want have to deal with in another language - that said, this use of nub was one of them, and since that is gone now maybe I should think again?). Thanks, -- Raul On Tue, Aug 19, 2014 at 8:00 PM, Dan Bron <[email protected]> wrote: > Is it possible to describe a significant amount of the strings occupying > memory as coming from a "small" universe? > > In other words, are symbols (of the s: variety) an option for you? If you > can describe your main table as a collection of symbols (in their integer > form, ie 6 s: s:), numeric values, and foreign keys into other in-memory > tables (ie the integers which dyad i. returns), you could express your > entire table as numbers, which should provide a significant savings in > space and time over a boxed representation. > > But that's a big re-engineering project. Even to get it back to the point > where you have the same confidence in the numeric representation as you do > in the current boxed implementation. (Plus, s: numbers have issues with > transience). > > Please excuse typos; sent from a phone. > > > On Aug 19, 2014, at 7:39 PM, Raul Miller <[email protected]> wrote: > > > > I updated the code in the live session and it's working much better now. > > > > Or at least, that part is. > > > > I'm also getting interface errors from 2!:0 and I am having to work > around > > that issue also. :/ (This issue, I think, represents kernel memory > > fragmentation - I guess linux is not tuned for processes which hold huge > > amounts of memory making system calls...) > > > > Thanks, > > > > -- > > Raul > > > > > > > >> On Tue, Aug 19, 2014 at 7:34 PM, Dan Bron <[email protected]> wrote: > >> > >> > >> There is also integrated rank support (a specific category special code) > >> for dyad -:"n , especially when n=1 (ie matching rows of tables has been > >> made particularly efficient). > >> > >> That said, it's probably worth doing a few performance tests on > >> medium-sized data sets to compare the performance of -:"1 to that of > *./ . > >> ~: rather than making a substitution on the blind and potentially > wasting a > >> 24 hour run (or more) on the larger, production inputs. > >> > >> -Dan > >> > >> Please excuse typos; sent from a phone. > >> > >>> On Aug 19, 2014, at 6:38 PM, Raul Miller <[email protected]> > wrote: > >>> > >>> I'd want to see some detailed reference on this issue (~.!.0 on > >> non-numeric > >>> arrays) before I'd want to blow another day or longer trying to > reproduce > >>> the problem with that change. > >>> > >>> Alternatively, I'd want to get into the C implementation and find how > >> this > >>> could happen. That maybe should be done as a theoretical exercise > >>> (understanding how the algorithm works and how it can fail) than as a > >>> practical exercise. > >>> > >>> Please also keep in mind that I have not eliminated hardware flaws from > >> the > >>> plausible cause list. Memory corruption (or things equivalent to memory > >>> corruption, such as an intermittently failing logic component) is an > >>> all-too-likely possibility. > >>> > >>> Thanks, > >>> > >>> -- > >>> Raul > >>> > >>> > >>> > >>>> On Tue, Aug 19, 2014 at 5:15 PM, Henry Rich <[email protected]> > >> wrote: > >>>> > >>>> ~.!.0 as I understand it uses a different algorithm from ~. even on > >>>> nonnumerics, and might be worth trying. > >>>> > >>>> I am sure that ~.!.0 is much faster than ~. of floating-point arrays > of > >>>> rank > 1. I think ~. is OK when the rank is 1. > >>>> > >>>> Henry Rich > >>>> > >>>> > >>>>> On 8/19/2014 2:11 PM, Raul Miller wrote: > >>>>> > >>>>> Please include the current time in the sequence of timestamps. The > code > >>>>> was > >>>>> still running at the point in time where I posted my email. > >>>>> > >>>>> That said, at this point, my attempt to interrupt succeeded, and I > have > >>>>> found the line of code which was stalled: > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
