I would just preprocess the file with Perl or Ruby: perl -ne 'next unless m#/#; s#(.*)/(.*)#\1\t\2#; print;' infile > outfile
That would give you Arts/Animation/Anime<TAB>Clubs_and_Organizations i.e. two columns for every line (lines without slashes will be skipped). Come to think of it, if your entire file is just 800k lines, I'd do the entire thing with Perl. HTH, /David On Fri, Oct 1, 2010 at 13:32, Rob Wilkerson <rwilker...@lotame.com> wrote: > Hey guys - > > I have a script that loads a list of ~800,000 category hierarchies, > filters them a bit and streams them through a PHP script for some > quick procedural work. The file contains one column and a snippet > looks like this: > > Arts > Arts/Animation > Arts/Animation/Anime > Arts/Animation/Anime/Characters > Arts/Animation/Anime/Clubs_and_Organizations > Arts/Animation/Anime/Collectibles > Arts/Animation/Anime/Collectibles/Cels > Arts/Animation/Anime/Collectibles/Models_and_Figures > Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures > Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Gundam > Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures/Zoids > Arts/Animation/Anime/Collectibles/Models_and_Figures/Models > Arts/Animation/Anime/Collectibles/Models_and_Figures/Models/Gundam > Arts/Animation/Anime/Collectibles/Shitajiki > Arts/Animation/Anime/Creators > Arts/Animation/Anime/Creators/Anno,_Hideaki > Arts/Animation/Anime/Creators/Ikuhara,_Kunihiko > Arts/Animation/Anime/Creators/Miyazaki,_Hayao > Arts/Animation/Anime/Creators/Studios > Arts/Animation/Anime/Creators/Studios/Studio_Ghibli > Arts/Animation/Anime/Creators/Studios/Studio_Ghibli/Titles > Arts/Animation/Anime/Distribution > Arts/Animation/Anime/Distribution/Companies > > Now I need to take it one step further. I need to get a count of how > many items are in "Arts", how many are in "Arts/Animation", etc. I > know a grouping and count is involved, but I can't wrap my mind around > how to get there since the category path depth is entirely variable > and I need these numbers relative to the "whole" (i.e. I need to know > how many times Arts/Animation/Anime appears rather than how many times > Anime appears at any level). > > Any guidance would be much appreciated. > > Rob Wilkerson > > The information transmitted in this > email is intended only for the > person(s) or entity to which it is > addressed and may contain > confidential and/or privileged > material. Any review, > retransmission, dissemination > or other use of, or taking of any > action in reliance upon, this > information by persons or entities > other than the intended recipient > is prohibited. If you received this > email in error, please contact the > sender and permanently delete the > email from any computer. > > -- David Vrensk Systems developer, ICE House AB Mobile: +46 703 74 69 00