Yes Skip, you got the heart of the problem. The biggest overhead for me was 
moving data. You see, an average data for my specific MRP problem entails 
around 30K records of strings and only couple of numeric columns. Reading in 
this much data into J actually inflates my J session to 90MB RAM even before I 
manipulate the data and I convert this 2 dimensional SQL data into an inverted 
database (I love J because of this); which makes 2 copies of the data, the 
original 2 dimension and the new inverted tables. 

At first, I tried memory mapping, bad idea, one process would lock and even 
before its finished ... another process would fail because the first process 
haven't released yet. Sometimes opening the file would even fail ... much to 
the detriment of my hair.

My next iteration, I used 3!:1 and 3!:2 to send data over the socket interface. 
The code was cleaner but there were definitely time spent in data transfer to I 
used libzip to compress and decompress transferred data.

But I'm still transferring and loading a lot of data on the sub processes and 
this is when I got this idea of converting strings to numbers. If you would 
remember during J version 5, I was bugging everyone on how to optimize 
converting strings to number. Well, as soon as I get the data coming in from 
MS-SQL, I would convert the 2 dimensional data into a 2 dimensional data of 
indexes to a dictionary of string (date and time were converted to a number of 
YYYYMMDD.fractionalTime and numbers were left unconverted). This change made 
everything work faster, smaller memory footprint, faster data transfer, faster 
comparison. Of course there is a down side, which is the conversion from string 
to numbers and vice versa. 

Oh, btw, I'm fine with it being on the 2nd test set. I was just emphasizing 
that data is one of the biggest problem but is also very-very essential in 
distributed computing.

Thanks.

________________________________________
From: [email protected] [[email protected]] On Behalf Of Skip 
Cave [[email protected]]
Sent: Tuesday, February 16, 2010 10:51 PM
To: Chat forum
Subject: Re: [Jchat] Multiple cores

Alex Rufon wrote:
> I agree with Skip's idea but I would like to suggest including boxed arrays 
> or boxed strings in the test data set.
>
> I work exclusively with heterogeneous boxed arrays coming in from SQL Server. 
> I actually don't process pure numeric information. One sample computation is 
> matching a list of order by size against a list of consumption by size.
Skip replies:

I think that if Alex's problem is split across multiple processors, his
matching process will entail moving some data between the processors
during execution. I was trying to avoid that issue in my pure numerical
example. My thought was, if we test a process that doesn't require any
data movement other than the initial distribution and final assembly,
and then find that parallel execution of that process doesn't provide
all that much efficiency gain, then it is unlikely that processes that
do require data movement between processors would be executed more
efficiently in parallel.

Alex's problem has the advantage of a practical usage, so it could be
added in the test suite as a second test example of parallel processing.
However, I would expect his problem to be less amenable to distributing
the processor load than the pure in-place computational problem, due to
the requirement to move data between processors during execution.

Skip Cave
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to