I would just preprocess the file with Perl or Ruby:

perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > outfile

That would give you

Arts/Animation/Anime<TAB>Clubs_and_Organizations

i.e. two columns for every line (lines without slashes will be skipped).

Come to think of it, if your entire file is just 800k lines, I'd do the
entire thing with Perl.

HTH,

/David

On Fri, Oct 1, 2010 at 13:32, Rob Wilkerson <rwilker...@lotame.com> wrote:

> Hey guys -
>
> I have a script that loads a list of ~800,000 category hierarchies,
> filters them a bit and streams them through a PHP script for some
> quick procedural work. The file contains one column and a snippet
> looks like this:
>
> Arts
> Arts/Animation
> Arts/Animation/Anime
> Arts/Animation/Anime/Characters
> Arts/Animation/Anime/Clubs_and_Organizations
> Arts/Animation/Anime/Collectibles
> Arts/Animation/Anime/Collectibles/Cels
> Arts/Animation/Anime/Collectibles/Models_and_Figures
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Models
> Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam
> Arts/Animation/Anime/Collectibles/Shitajiki
> Arts/Animation/Anime/Creators
> Arts/Animation/Anime/Creators/Anno,_Hideaki
> Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko
> Arts/Animation/Anime/Creators/Miyazaki,_Hayao
> Arts/Animation/Anime/Creators/Studios
> Arts/Animation/Anime/Creators/Studios/Studio_Ghibli
> Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles
> Arts/Animation/Anime/Distribution
> Arts/Animation/Anime/Distribution/Companies
>
> Now I need to take it one step further. I need to get a count of how
> many items are in "Arts", how many are in "Arts/Animation", etc. I
> know a grouping and count is involved, but I can't wrap my mind around
> how to get there since the category path depth is entirely variable
> and I need these numbers relative to the "whole" (i.e. I need to know
> how many times Arts/Animation/Anime appears rather than how many times
> Anime appears at any level).
>
> Any guidance would be much appreciated.
>
> Rob Wilkerson
>
> The information transmitted in this
> email is intended only for the
> person(s) or entity to which it is
> addressed and may contain
> confidential and/or privileged
> material. Any review,
> retransmission, dissemination
> or other use of, or taking of any
> action in reliance upon, this
> information by persons or entities
> other than the intended recipient
> is prohibited. If you received this
> email in error, please contact the
> sender and permanently delete the
> email from any computer.
>
>


-- 
David Vrensk
Systems developer, ICE House AB
Mobile: +46 703 74 69 00

Reply via email to