Re: Sorting an extremely LARGE file

shawn wilson Mon, 08 Aug 2011 06:26:29 -0700

On Aug 8, 2011 12:11 AM, "Ramprasad Prasad" <ramprasad...@gmail.com> wrote:
>
> Using the system linux sort ... Does not help.
> On my dual quad core machine , (8 gb ram) sort -n file takes 10
> minutes and in the end produces no output.
>

I had a smaller file and 32g to play with on a dual quad core (dl320). Sort
just can't handle more than 2~4 gigs.

> when I put this data in mysql , there is an index on the order by
> field ... But I guess keys don't help when you are selecting the
> entire table.
>
> I guess there is a serious need for re-architecting , rather than
> create such monstrous files, but when people work with legacy systems
> which worked fine when there was lower usage and now you tell then you
> need a overhaul because the current system doesn't scale ... That
> takes a lot of convincing
>

You're dealing with a similar issue that I had in this respect too. The only
difference is that I created my own issue out of ignorance (having never
dealt with as much data and having set my dl320 to splice, sort, and merge I
got through that). Well, with this data I just threw 30+ fields of a hundred
thousand lines (yes, you've still got more data to deal with) into one
table. This worked ok until my queries got a bit more complex at which
point, it took me 8+ hours to generate a report. I rethink the tables (or
more like read a bit and think about what the hell I'm doing) and create a
half dozen relationships and I get the report down to little under 2 hours.

My advise is to think about rethinking your db. This is probably going to
mean rethinking software too (or, at least the queries it makes).

You might want to check out the #mysql freenode irc channel - most of them
are pompous but you'll get your answers. I think perl is less related to
your issue but the people in the #dbi and dbic perl irc channels are much
more easy going with their business.

> On 8/8/11, Uri Guttman <u...@stemsystems.com> wrote:
> >>>>>> "RP" == Rajeev Prasad <rp.ne...@yahoo.com> writes:
> >
> >   RP> hi, you can try this: first get only that field (sed/awk/perl)
> >   RP> whihc you want to sort on in a file. sort that file which i assume
> >   RP> would be lot less in size then your current file/table. then run a
> >   RP> loop on the main file using sorted file as variable.
> >
> >   RP>
> >   RP> here is the logic in shell:
> >   RP>
> >   RP> awk '{print $<filed-to-be-sorted-on>}' <large-file> > tmp-file
> >   RP>
> >   RP> sort <tmp-file>
> >   RP>
> >
> >   RP> for id in `cat <sorted-temp-file>`;do grep $id <large-file> >>
> > sorted-large-file;done
> >
> > have you thought about the time this will take? you are doing an O( N**2
> > ) grep there. you are looping over all N keys and then scanning the file
> > N lines for each key. that will take a very long time for such a large
> > file. as others have said, either use the sort utility or do a
> > merge/sort on the records. your way is effectively a slow bubble sort!
> >
> > uri
> >
> > --
> > Uri Guttman  --  uri AT perlhunter DOT com  ---
http://www.perlhunter.com
> > --
> > ------------  Perl Developer Recruiting and Placement Services
> > -------------
> > -----  Perl Code Review, Architecture, Development, Training, Support
> > -------
> >
>
> --
> Sent from my mobile device
>
> Thanks
> Ram
>  <http://www.netcore.co.in/>
>
>
>
>
> n <http://pragatee.com>
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>

Re: Sorting an extremely LARGE file

Reply via email to