Hi Jacob,
As the preceding discussion has illustrated, there are obviously a
number of options, nearly all of which will work well for what you
describe. As Pete Meyer suggested above, the best language may be the
one you already know (I also echo his suggestion to test your code on
a small subset of data first). For my $0.02, I would use Perl, because
I'm already comfortable with it and there is a lot of
readily-modifiable code already freely available. Perl excels at
quickly extracting, reformatting, and reporting data - hence the
backronym, Practical Extraction and Report Language. It's built on C,
so tends to be fast when crunching numbers.

I'm still learning Python, but its functionality seems comparable.
Also, as Nat Echols noted, it's more readable to others, and it seems
to be more fashionable than Perl just now, so will perhaps be more
useful to you in future programming projects.

Good luck!

Best,
Anna

On Wed, Sep 12, 2012 at 8:21 AM, Quentin Delettre <[email protected]> wrote:
> I agree with Pete. Moreover, Python doesn't have built-in statistic
> functions but adding package (numpy and scipy in this case) is very simple.
>
> Quentin
>
> Le 12/09/2012 17:11, Pete Meyer a écrit :
>
>> One thing to keep in mind is that there's usually a trade-off between
>> setup (writing and testing) and execution time.  For one-off data
>> processing, I'd focus on implementation speed rather than execution speed
>> (in other words, FORTRAN might not be ideal unless you're already fluent
>> with it).
>>
>> That said, I'd take a look at python, octave or R.  Python's relatively
>> easy to learn, and more flexible than octave/R; but it doesn't have the
>> built-in statistic functions that octave and R do.
>>
>> One other tip which you've probably already though of - Depending on your
>> runtimes (I don't think 100s MB of data is usually considered an enormous
>> amount, but it'll depend on what you're doing) it may be worth getting
>> things working on a small subset of the data first.
>>
>> Pete
>>
>> Jacob Keller wrote:
>>>
>>> Dear List,
>>>
>>> since this probably comes up a lot in manipulation of pdb/reflection
>>> files
>>> and so on, I was curious what people thought would be the best language
>>> for
>>> the following: I have some huge (100s MB) tables of tab-delimited data on
>>> which I would like to do some math (averaging, sigmas, simple arithmetic,
>>> etc) as well as some sorting and rejecting. It can be done in Excel, but
>>> this is exceedingly slow even in 64-bit, so I am looking to do it through
>>> some scripting. Just as an example, a "sort" which takes >10 min in Excel
>>> takes ~10 sec max with the unix command sort (seems crazy, no?). Any
>>> suggestions?
>>>
>>> Thanks, and sorry for being off-topic,
>>>
>>> Jacob
>>>
>>
>>
>

Reply via email to