Hello,

I had the need to take a text file with several million lines, construct a 
URL with parameters picked from the tab limited file, and fire them one 
after the other. After I read about Julia, I decided to try this in Julia. 
However my initial implementation turned out to be slow and I was getting 
close to my deadline. I then kept the Julia implementation aside and wrote 
the same thing in Go, my other favorite language. Go version is twice (at 
least) as fast as the Julia version. Now the task/deadline is over, I am 
coming back to the Julia version to see what I did wrong.

Go and Julia version are not written alike. In Go, I have just one main 
thread reading a file and 5 go-routines waiting in a channel and one of 
them will get the 'line/job' and fire off the url, wait for a response, 
parse the JSON, and look for an id in a specific place, and go back to wait 
for more items from the channel. 

Julia code is very similar to the one discussed in the thread quoted below. 
I invoke Julia with -p 5 and then have *each* process open the file and 
read all lines. However each process is only processing 1/5th of the lines 
and skipping others. It is a slight modification of what was discussed in 
this thread 
https://groups.google.com/d/msg/julia-users/Kr8vGwdXcJA/8ynOghlYaGgJ

Julia code (no server URL or source for that though ) : 
https://github.com/harikb/scratchpad1/tree/master/julia2
Server could be anything that returns a static JSON.

Considering the files will entirely fit in filesystem cache and I am 
running this on a fairly large system (procinfo says 24 cores, 100G ram, 
50G or free even after removing cached). The input file is only 875K. This 
should ideally mean I can read the files several times in any programming 
language and not skip a beat. wc -l on the file takes only 0m0.002s . Any 
log/output is written to a fusion-io based flash disk. All fairly high end.

https://github.com/harikb/scratchpad1/tree/master/julia2

At this point, considering the machine is reasonably good, the only 
bottleneck should be the time URL firing takes (it is a GET request, but 
the other side has some processing to do) or the subsequent JSON parsing.

Where do I go from here? How do I find out (a) are HTTP connections being 
re-used by the underlying library? I am using this library 
https://github.com/JuliaWeb/Requests.jl
If not, that could answer this difference. How do I profile this code? I am 
using julia 0.3.7 (since Requests.jl does not work with 0.4 nightly)

Any help is appreciated.
Thanks
--
Harry

Reply via email to