Conceptually speaking, at least two strings are unique in (almost) every
row. Others are duplicated, yes.

Meanwhile, a significant part of this code also needs to run on 32 bit J
(because j602 is apparently the only version of J which supports xml/sax).
It's actually grinding through this data one row at a time (appending each
to file as it completes). And I am sharing code between the two stages of
processing, because tested code tends to be more robust than untested code.

Anyways, it's an interesting idea, and if I was really pressed for
efficiency, I would try that. But the bulk of the cpu time is not here,
it's several orders of magnitude larger in that xml parsing stage. (And I
have been tempted to offload some of that to python or whatever, but there
were some things that I was doing in J that I did not want have to deal
with in another language - that said, this use of nub was one of them, and
since that is gone now maybe I should think again?).

Thanks,

-- 
Raul


On Tue, Aug 19, 2014 at 8:00 PM, Dan Bron <[email protected]> wrote:

> Is it possible to describe a significant amount of the strings occupying
> memory as coming from a "small" universe?
>
> In other words, are symbols (of the s: variety) an option for you? If you
> can describe your main table as a collection of symbols (in their integer
> form, ie 6 s: s:), numeric values, and foreign keys into other in-memory
> tables (ie the integers which dyad i. returns), you could express your
> entire table as numbers, which should provide a significant savings in
> space and time over a boxed representation.
>
> But that's a big re-engineering project. Even to get it back to the point
> where you have the same confidence in the numeric representation as you do
> in the current boxed implementation. (Plus, s: numbers have issues with
> transience).
>
> Please excuse typos; sent from a phone.
>
> > On Aug 19, 2014, at 7:39 PM, Raul Miller <[email protected]> wrote:
> >
> > I updated the code in the live session and it's working much better now.
> >
> > Or at least, that part is.
> >
> > I'm also getting interface errors from 2!:0 and I am having to work
> around
> > that issue also. :/ (This issue, I think, represents kernel memory
> > fragmentation - I guess linux is not tuned for processes which hold huge
> > amounts of memory making system calls...)
> >
> > Thanks,
> >
> > --
> > Raul
> >
> >
> >
> >> On Tue, Aug 19, 2014 at 7:34 PM, Dan Bron <[email protected]> wrote:
> >>
> >>
> >> There is also integrated rank support (a specific category special code)
> >> for dyad -:"n , especially when n=1 (ie matching rows of tables has been
> >> made particularly efficient).
> >>
> >> That said, it's probably worth doing a few performance tests on
> >> medium-sized data sets to compare the performance of -:"1 to that of
> *./ .
> >> ~: rather than making a substitution on the blind and potentially
> wasting a
> >> 24 hour run (or more) on the larger, production inputs.
> >>
> >> -Dan
> >>
> >> Please excuse typos; sent from a phone.
> >>
> >>> On Aug 19, 2014, at 6:38 PM, Raul Miller <[email protected]>
> wrote:
> >>>
> >>> I'd want to see some detailed reference on this issue (~.!.0 on
> >> non-numeric
> >>> arrays) before I'd want to blow another day or longer trying to
> reproduce
> >>> the problem with that change.
> >>>
> >>> Alternatively, I'd want to get into the C implementation and find how
> >> this
> >>> could happen. That maybe should be done as a theoretical exercise
> >>> (understanding how the algorithm works and how it can fail) than as a
> >>> practical exercise.
> >>>
> >>> Please also keep in mind that I have not eliminated hardware flaws from
> >> the
> >>> plausible cause list. Memory corruption (or things equivalent to memory
> >>> corruption, such as an intermittently failing logic component) is an
> >>> all-too-likely possibility.
> >>>
> >>> Thanks,
> >>>
> >>> --
> >>> Raul
> >>>
> >>>
> >>>
> >>>> On Tue, Aug 19, 2014 at 5:15 PM, Henry Rich <[email protected]>
> >> wrote:
> >>>>
> >>>> ~.!.0 as I understand it uses a different algorithm from ~. even on
> >>>> nonnumerics, and might be worth trying.
> >>>>
> >>>> I am sure that ~.!.0 is much faster than ~. of floating-point arrays
> of
> >>>> rank > 1.  I think ~. is OK when the rank is 1.
> >>>>
> >>>> Henry Rich
> >>>>
> >>>>
> >>>>> On 8/19/2014 2:11 PM, Raul Miller wrote:
> >>>>>
> >>>>> Please include the current time in the sequence of timestamps. The
> code
> >>>>> was
> >>>>> still running at the point in time where I posted my email.
> >>>>>
> >>>>> That said, at this point, my attempt to interrupt succeeded, and I
> have
> >>>>> found the line of code which was stalled:
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to