Hi Ying, Thanks for clarifying. Is it intended then that the last merge.* file be left behind (so that you could then use huge-delete.pl on the merge file again in order to use a new value for --uremove and --remove)?
If that's the case, then suppose I wanted to use --remove 50 and --remove 10 on another run (using my left behind merge.82 file as the input). Could you show the sequence of commands I would need to run to do this? This sounds like fairly useful functionality (if I'm understanding everything correctly), but I'm not sure I totally see how to do this... Thanks! Ted On Mon, May 3, 2010 at 10:24 AM, Ying Liu <liux0...@umn.edu> wrote: > Hi Ted, > > huge-count.output is the results after --uremove 100 and --remove 5. > merge.82 is > the results without --uremove 100 and --remove 5. huge-count.output is the > correct results. > > After I adjust the huge-count.pl code, I first combine the results of split > files. And then > use huge-delete.pl to --uremove and --remove. In real experiments, this is > more continent > because I could change --uremove and --remove after I get the final merge > results. > > Thanks, > Ying > > > Ted Pedersen wrote: >> >> Hi Ying, >> >> I've run the following job twice now, and both times it has ended with >> a spare merge file being left behind (and hugecount.output looking >> like it might be incomplete). Can you take a look at this and see if >> there's a problem either in hugecount.pl or how I'm running it? >> >> Details below, and do let me know if I can provide you with any >> additional information. >> >> Thanks! >> Ted >> >> --------------------------- >> Here's the job I've been running (this is a bash shell script I submit) >> >> export INPUT=/home/ted/Corpora/TDP-XIE >> export SPLIT=8000000 >> export OUTPUT=/home/ted/mycount >> >> rm -fr $OUTPUT >> mkdir $OUTPUT >> >> huge-count.pl --tokenlist --stop /home/ted/vector/stoplist --split >> $SPLIT --uremove 100 --remove 5 --window 4 $OUTPUT $INPUT >> >> ------------------------------------ >> >> And this is the output that is left behind... >> >> t...@maraca:~/myhugecount$ ls >> huge-count.output merge.82 >> >> t...@maraca:~/myhugecount$ wc * >> 5048833 15146497 154092132 huge-count.output >> 33467287 100401859 1027141307 merge.82 >> 38516120 115548356 1181233439 total >> >> t...@maraca:~/myhugecount$ ls >> huge-count.output merge.82 >> >> -------------------------------------- >> My input is just the XIE portion of the GigaWord data... >> >> t...@maraca:~/Corpora/TDP-XIE$ ls >> xie199501.txt xie199606.txt xie199711.txt xie199904.txt xie200009.txt >> xie199502.txt xie199607.txt xie199712.txt xie199905.txt xie200010.txt >> xie199503.txt xie199608.txt xie199801.txt xie199906.txt xie200011.txt >> xie199504.txt xie199609.txt xie199802.txt xie199907.txt xie200012.txt >> xie199505.txt xie199610.txt xie199803.txt xie199908.txt xie200101.txt >> xie199506.txt xie199611.txt xie199804.txt xie199909.txt xie200102.txt >> xie199507.txt xie199612.txt xie199805.txt xie199910.txt xie200103.txt >> xie199508.txt xie199701.txt xie199806.txt xie199911.txt xie200104.txt >> xie199509.txt xie199702.txt xie199807.txt xie199912.txt xie200105.txt >> xie199510.txt xie199703.txt xie199808.txt xie200001.txt xie200106.txt >> xie199511.txt xie199704.txt xie199809.txt xie200002.txt xie200107.txt >> xie199512.txt xie199705.txt xie199810.txt xie200003.txt xie200108.txt >> xie199601.txt xie199706.txt xie199811.txt xie200004.txt xie200109.txt >> xie199602.txt xie199707.txt xie199812.txt xie200005.txt xie200110.txt >> xie199603.txt xie199708.txt xie199901.txt xie200006.txt xie200111.txt >> xie199604.txt xie199709.txt xie199902.txt xie200007.txt >> xie199605.txt xie199710.txt xie199903.txt xie200008.txt >> >> t...@maraca:~/Corpora/TDP-XIE$ wc * >> 6202 1184507 7024023 xie199501.txt >> 6010 1134793 6715677 xie199502.txt >> 7400 1444049 8550899 xie199503.txt >> 6524 1255501 7446336 xie199504.txt >> 7402 1450923 8568172 xie199505.txt >> 7170 1409804 8332601 xie199506.txt >> 6927 1303696 7719045 xie199507.txt >> 7275 1413761 8299012 xie199508.txt >> 6876 1372918 8128653 xie199509.txt >> 7286 1405115 8364952 xie199510.txt >> 7450 1388922 8224930 xie199511.txt >> 6672 1300072 7715936 xie199512.txt >> 7188 1334015 7862867 xie199601.txt >> 6696 1215131 7156749 xie199602.txt >> 8033 1519423 9013013 xie199603.txt >> 7788 1473207 8730995 xie199604.txt >> 8129 1556154 9218247 xie199605.txt >> 8032 1549764 9194048 xie199606.txt >> 8514 1640081 9639336 xie199607.txt >> 8004 1531682 9038373 xie199608.txt >> 7828 1524511 9058350 xie199609.txt >> 8060 1549633 9218619 xie199610.txt >> 7734 1506007 8968417 xie199611.txt >> 7452 1465136 8689718 xie199612.txt >> 7794 1499399 8837231 xie199701.txt >> 6928 1333855 7875190 xie199702.txt >> 8525 1721433 10204831 xie199703.txt >> 7840 1546811 9144221 xie199704.txt >> 8240 1637605 9657260 xie199705.txt >> 7603 1511550 8911029 xie199706.txt >> 7658 1501173 8862111 xie199707.txt >> 7825 1495545 8777642 xie199708.txt >> 8025 1639203 9736899 xie199709.txt >> 8334 1661903 9854493 xie199710.txt >> 8509 1663965 9882142 xie199711.txt >> 8282 1657365 9812754 xie199712.txt >> 8053 1550193 9154341 xie199801.txt >> 7857 1555125 9163722 xie199802.txt >> 8848 1742820 10342905 xie199803.txt >> 8148 1581470 9393017 xie199804.txt >> 8965 1752483 10382755 xie199805.txt >> 8882 1749295 10302189 xie199806.txt >> 9051 1788730 10542299 xie199807.txt >> 8467 1690532 9940580 xie199808.txt >> 8284 1728637 10189171 xie199809.txt >> 8539 1739594 10310588 xie199810.txt >> 8649 1707732 10109928 xie199811.txt >> 9727 1867443 10940369 xie199812.txt >> 8203 1603090 9426368 xie199901.txt >> 7350 1461607 8615055 xie199902.txt >> 9442 1882454 11126748 xie199903.txt >> 9039 1788982 10560225 xie199904.txt >> 8839 1759592 10389989 xie199905.txt >> 8653 1732687 10220635 xie199906.txt >> 8698 1742722 10224635 xie199907.txt >> 8978 1751058 10251998 xie199908.txt >> 9175 1850108 10899941 xie199909.txt >> 8854 1733381 10273283 xie199910.txt >> 8679 1658967 9789022 xie199911.txt >> 8788 1716177 10116139 xie199912.txt >> 8516 1606427 9434389 xie200001.txt >> 8051 1571315 9239155 xie200002.txt >> 9717 1895946 11166496 xie200003.txt >> 9196 1830029 10819900 xie200004.txt >> 9392 1805885 10689714 xie200005.txt >> 9434 1826577 10834233 xie200006.txt >> 9100 1790377 10553950 xie200007.txt >> 9267 1818165 10695151 xie200008.txt >> 9571 1779519 10427577 xie200009.txt >> 8864 1796484 10646671 xie200010.txt >> 8841 1731864 10225318 xie200011.txt >> 8007 1623146 9549503 xie200012.txt >> 7880 1480773 8681644 xie200101.txt >> 8235 1581014 9272288 xie200102.txt >> 9643 1937289 11393670 xie200103.txt >> 8859 1748990 10261162 xie200104.txt >> 8924 1758391 10328510 xie200105.txt >> 8620 1716646 10131853 xie200106.txt >> 8581 1709051 10080829 xie200107.txt >> 9882 1983160 11583688 xie200108.txt >> 8867 1728337 10179257 xie200109.txt >> 9437 1793786 10614699 xie200110.txt >> 1740 339343 1995533 xie200111.txt >> 679007 132786005 783905863 total >> >> > > -- Ted Pedersen http://www.d.umn.edu/~tpederse