Yes Skip, you got the heart of the problem. The biggest overhead for me was moving data. You see, an average data for my specific MRP problem entails around 30K records of strings and only couple of numeric columns. Reading in this much data into J actually inflates my J session to 90MB RAM even before I manipulate the data and I convert this 2 dimensional SQL data into an inverted database (I love J because of this); which makes 2 copies of the data, the original 2 dimension and the new inverted tables.
At first, I tried memory mapping, bad idea, one process would lock and even before its finished ... another process would fail because the first process haven't released yet. Sometimes opening the file would even fail ... much to the detriment of my hair. My next iteration, I used 3!:1 and 3!:2 to send data over the socket interface. The code was cleaner but there were definitely time spent in data transfer to I used libzip to compress and decompress transferred data. But I'm still transferring and loading a lot of data on the sub processes and this is when I got this idea of converting strings to numbers. If you would remember during J version 5, I was bugging everyone on how to optimize converting strings to number. Well, as soon as I get the data coming in from MS-SQL, I would convert the 2 dimensional data into a 2 dimensional data of indexes to a dictionary of string (date and time were converted to a number of YYYYMMDD.fractionalTime and numbers were left unconverted). This change made everything work faster, smaller memory footprint, faster data transfer, faster comparison. Of course there is a down side, which is the conversion from string to numbers and vice versa. Oh, btw, I'm fine with it being on the 2nd test set. I was just emphasizing that data is one of the biggest problem but is also very-very essential in distributed computing. Thanks. ________________________________________ From: [email protected] [[email protected]] On Behalf Of Skip Cave [[email protected]] Sent: Tuesday, February 16, 2010 10:51 PM To: Chat forum Subject: Re: [Jchat] Multiple cores Alex Rufon wrote: > I agree with Skip's idea but I would like to suggest including boxed arrays > or boxed strings in the test data set. > > I work exclusively with heterogeneous boxed arrays coming in from SQL Server. > I actually don't process pure numeric information. One sample computation is > matching a list of order by size against a list of consumption by size. Skip replies: I think that if Alex's problem is split across multiple processors, his matching process will entail moving some data between the processors during execution. I was trying to avoid that issue in my pure numerical example. My thought was, if we test a process that doesn't require any data movement other than the initial distribution and final assembly, and then find that parallel execution of that process doesn't provide all that much efficiency gain, then it is unlikely that processes that do require data movement between processors would be executed more efficiently in parallel. Alex's problem has the advantage of a practical usage, so it could be added in the test suite as a second test example of parallel processing. However, I would expect his problem to be less amenable to distributing the processor load than the pure in-place computational problem, due to the requirement to move data between processors during execution. Skip Cave ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
