I am trying to profile this code, so here is what I have so far. I added
the following code to the path taken for the single-process mode.
I didn't bother with the multi-process once since I didn't know how to deal
with @profile and remotecall_wait
@profile processOneFile(3085, 35649, filename)
bt, lidict = Profile.retrieve()
println("Profiling done")
for (k,v) in lidict
println(v)
end
Output is here
https://github.com/harikb/scratchpad1/blob/master/julia2/run1.txt (Ran
with julia 0.3.7)
another run
https://github.com/harikb/scratchpad1/blob/master/julia2/run2.txt (Ran
with julia-debug 0.3.7) - in case it gave better results.
However, there is quite a few lines marked without line or file info.
On Wednesday, April 22, 2015 at 2:44:13 AM UTC-7, Yuuki Soho wrote:
If I understand correctly now you are doing only 5 requests at the same
time? It seems to me you could do much more.
But that hides the inefficiency, whatever level it exists. The Go program
also uses only 5 parallel connections.
On Wednesday, April 22, 2015 at 1:15:20 PM UTC-7, Stefan Karpinski wrote:
Honestly, I'm pretty pleased with that performance. This kind of thing
is Go's bread and butter – being within a factor of 2 of Go at something
like this is really good. That said, if you do figure out anything that's a
bottleneck here, please file issues – there's no fundamental reason Julia
can't be just as fast or faster than any other language at this.
Stefan, yes, it is about 2x if I subtract the 10 seconds or so (whatever it
appears to me) as the startup time. I am running Julia 0.3.7 on a box with
a deprecated GnuTLS (RHEL). The deprecation warning msg comes about 8
seconds into the run and I wait another 2 seconds before I see the first
print statement from my code ("Started N processes" message). My
calculations already exclude these 10 seconds.
I wonder if I would get better startup time with 0.4, but Requests.jl is
not compatible with it (nor do I find any other library for 0.4). I will
try 0.4 again and see I can fix Requests.jl
Any help is appreciated on further analysis of the profile output.
Thanks
--
Harry
On Thursday, April 23, 2015 at 7:21:11 AM UTC-7, Seth wrote:
>
> The use of Requests.jl makes this very hard to benchmark accurately since
> it introduces (non-measurable) dependencies on network resources.
>
> If you @profile the function, can you tell where it's spending most of its
> time?
>
> On Tuesday, April 21, 2015 at 2:12:52 PM UTC-7, Harry B wrote:
>>
>> Hello,
>>
>> I had the need to take a text file with several million lines, construct
>> a URL with parameters picked from the tab limited file, and fire them one
>> after the other. After I read about Julia, I decided to try this in Julia.
>> However my initial implementation turned out to be slow and I was getting
>> close to my deadline. I then kept the Julia implementation aside and wrote
>> the same thing in Go, my other favorite language. Go version is twice (at
>> least) as fast as the Julia version. Now the task/deadline is over, I am
>> coming back to the Julia version to see what I did wrong.
>>
>> Go and Julia version are not written alike. In Go, I have just one main
>> thread reading a file and 5 go-routines waiting in a channel and one of
>> them will get the 'line/job' and fire off the url, wait for a response,
>> parse the JSON, and look for an id in a specific place, and go back to wait
>> for more items from the channel.
>>
>> Julia code is very similar to the one discussed in the thread quoted
>> below. I invoke Julia with -p 5 and then have *each* process open the file
>> and read all lines. However each process is only processing 1/5th of the
>> lines and skipping others. It is a slight modification of what was
>> discussed in this thread
>> https://groups.google.com/d/msg/julia-users/Kr8vGwdXcJA/8ynOghlYaGgJ
>>
>> Julia code (no server URL or source for that though ) :
>> https://github.com/harikb/scratchpad1/tree/master/julia2
>> Server could be anything that returns a static JSON.
>>
>> Considering the files will entirely fit in filesystem cache and I am
>> running this on a fairly large system (procinfo says 24 cores, 100G ram,
>> 50G or free even after removing cached). The input file is only 875K. This
>> should ideally mean I can read the files several times in any programming
>> language and not skip a beat. wc -l on the file takes only 0m0.002s . Any
>> log/output is written to a fusion-io based flash disk. All fairly high end.
>>
>> https://github.com/harikb/scratchpad1/tree/master/julia2
>>
>> At this point, considering the machine is reasonably good, the only
>> bottleneck should be the time URL firing takes (it is a GET request, but
>> the other side has some processing to do) or the subsequent JSON parsing.
>>
>> Where do I go from here? How do I find out (a) are HTTP connections being
>> re-used by the underlying library? I am using this library
>> https://github.com/JuliaWeb/Requests.jl
>> If not, that could answer this difference. How do I profile this code? I
>> am using julia 0.3.7 (since Requests.jl does not work with 0.4 nightly)
>>
>> Any help is appreciated.
>> Thanks
>> --
>> Harry
>>
>>