Re: parsing and chunking large xyz files

Francis Avila Fri, 26 Dec 2014 11:41:16 -0800

If you need parallelism, you need to do an indexing pass first to determine 
the group boundaries. Then you can process them in parallel because you 
know the units-of-work.


Iota is an ok fit for this, so I suggest trying it first. (You may have to 
dial down the parallelism of r/fold to avoid stressing your OS's mmap.)

Unfortunately this file format makes it impossible to determine chunk 
boundaries in parallel: there's no (easy) way to distinguish atom count and 
comment lines without knowing the state of previous lines, so you cannot 
index in parallel. However you can cache the index or even write it to disk 
for iota to read later.

I wrote a small gist to demonstrate the basic 
procedure: https://gist.github.com/favila/035718ab762c6adfc8dc


On Friday, December 26, 2014 10:00:57 AM UTC-6, cej38 wrote:
>
> Line-by-line is the problem.  I need groups of lines at a time.
>
>
>
>
> On Friday, December 26, 2014 10:33:27 AM UTC-5, Jony Hudson wrote:
>>
>> I think clojure.csv reads CSV files lazily, line-by-line, so might be 
>> useful to take a look at:
>>
>> https://github.com/clojure/data.csv
>>
>>
>> Jony
>>
>> On Friday, 26 December 2014 14:49:59 UTC, cej38 wrote:
>>>
>>> In molecular dynamics a popular format for writing out the positions of 
>>> the atoms in a system is the xyz file format (see: 
>>> http://en.wikipedia.org/wiki/XYZ_file_format and/or 
>>> http://www.ks.uiuc.edu/Research/vmd/plugins/molfile/xyzplugin.html). 
>>>  The format allows for storing the positions of the atoms at different 
>>> snapshots in time (aka "time step").  You may have a few to millions of 
>>> atoms in your system and you may have thousands of time steps represented 
>>> in the file.  It is easy to end up with a single file that is many GB in 
>>> size.  Here is a shell command that will create a very simple, and very 
>>> small, test file (note that the positions of the atoms are completely 
>>> unrealistic-they are all sitting on top of each other)
>>>
>>> perl -e 'open(F, ">>test1.xyz"); for( $t= 1; $t < 11; $t = $t +1){print 
>>> F "10\n\n"; for( $a = 1; $a < 11; $a = $a + 1 ){print F "C  0.000 0.000 
>>> 0.0000\n";}}; close(F);'
>>>
>>>
>>> Here is a shell command that will produce a more complicated file 
>>> structure (note that depending on who wrote the code that output the file 
>>> there may be other columns of data at the end of each row, also the number 
>>> of decimal places kept and the type of spacing between elements may 
>>> change), this file has a different number of atoms with each time step :
>>>
>>> perl -e 'open(F, ">>test2.xyz"); for( $t= 1; $t < 5; $t = $t +1){my $s= 
>>> $t + 10; print F "$s \n"; my $color  = substr ("abcd efghij klmno pqrs tuv 
>>> wxyz", int(rand(10)), int(rand(10))); print F $color; print F "\n" ;for( $a 
>>> = 1; $a < (11 +$t); $a = $a + 1 ){print F "C    10.000000   10.00000   
>>> 10.00000   $a\n";}}; close(F);'
>>> perl -e 'open(F, ">>test2.xyz"); for( $t= 1; $t < 5; $t = $t +1){my $s= 
>>> $t + 10; print F "$s \n"; myperl -e 'open(F, ">>test2.xyz"); for( $t= 1; $t 
>>> < 5; $t = $t +1){my $s= $t + 10; print F "$s \n"; my
>>>
>>> Ok, that is the background to get to my question.  I need a way to parse 
>>> these files and group the lines into time steps.  I currently have 
>>> something that works but only in cases where the file size is relatively 
>>> small-it reads the whole file into memory.  I would like to use something 
>>> like iota that will allow me lazily parse the file and run reducers on the 
>>> data.  Any help would be really appreciated.
>>>
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: parsing and chunking large xyz files

Reply via email to