Hi Ying,

Thanks for clarifying. Is it intended then that the last merge.* file
be left behind (so that you could then use huge-delete.pl on the merge
file again in order to use a new value for --uremove and --remove)?

If that's the case, then suppose I wanted to use --remove 50 and
--remove 10 on another run (using my left behind merge.82 file as the
input). Could you show the sequence of commands I would need to run to
do this? This sounds like fairly useful functionality (if I'm
understanding everything correctly), but I'm not sure I totally see
how to do this...

Thanks!
Ted

On Mon, May 3, 2010 at 10:24 AM, Ying Liu <liux0...@umn.edu> wrote:
> Hi Ted,
>
> huge-count.output is the results after --uremove 100 and --remove 5.
> merge.82 is
> the results without --uremove 100 and --remove 5. huge-count.output is the
> correct results.
>
> After I adjust the huge-count.pl code, I first combine the results of split
> files. And then
> use huge-delete.pl to --uremove and --remove. In real experiments, this is
> more continent
> because  I could  change --uremove and --remove after I get the final merge
> results.
>
> Thanks,
> Ying
>
>
> Ted Pedersen wrote:
>>
>> Hi Ying,
>>
>> I've run the following job twice now, and both times it has ended with
>> a spare merge file being left behind (and hugecount.output looking
>> like it might be incomplete). Can you take a look at this and see if
>> there's a problem either in hugecount.pl or how I'm running it?
>>
>> Details below, and do let me know if I can provide you with any
>> additional information.
>>
>> Thanks!
>> Ted
>>
>> ---------------------------
>> Here's the job I've been running (this is a bash shell script I submit)
>>
>> export INPUT=/home/ted/Corpora/TDP-XIE
>> export SPLIT=8000000
>> export OUTPUT=/home/ted/mycount
>>
>> rm -fr $OUTPUT
>> mkdir $OUTPUT
>>
>> huge-count.pl --tokenlist --stop /home/ted/vector/stoplist  --split
>> $SPLIT  --uremove 100 --remove 5  --window 4 $OUTPUT $INPUT
>>
>> ------------------------------------
>>
>> And this is the output that is left behind...
>>
>> t...@maraca:~/myhugecount$ ls
>> huge-count.output  merge.82
>>
>> t...@maraca:~/myhugecount$ wc *
>>   5048833   15146497  154092132 huge-count.output
>>  33467287  100401859 1027141307 merge.82
>>  38516120  115548356 1181233439 total
>>
>> t...@maraca:~/myhugecount$ ls
>> huge-count.output  merge.82
>>
>> --------------------------------------
>> My input is just the XIE portion of the GigaWord data...
>>
>> t...@maraca:~/Corpora/TDP-XIE$ ls
>> xie199501.txt  xie199606.txt  xie199711.txt  xie199904.txt  xie200009.txt
>> xie199502.txt  xie199607.txt  xie199712.txt  xie199905.txt  xie200010.txt
>> xie199503.txt  xie199608.txt  xie199801.txt  xie199906.txt  xie200011.txt
>> xie199504.txt  xie199609.txt  xie199802.txt  xie199907.txt  xie200012.txt
>> xie199505.txt  xie199610.txt  xie199803.txt  xie199908.txt  xie200101.txt
>> xie199506.txt  xie199611.txt  xie199804.txt  xie199909.txt  xie200102.txt
>> xie199507.txt  xie199612.txt  xie199805.txt  xie199910.txt  xie200103.txt
>> xie199508.txt  xie199701.txt  xie199806.txt  xie199911.txt  xie200104.txt
>> xie199509.txt  xie199702.txt  xie199807.txt  xie199912.txt  xie200105.txt
>> xie199510.txt  xie199703.txt  xie199808.txt  xie200001.txt  xie200106.txt
>> xie199511.txt  xie199704.txt  xie199809.txt  xie200002.txt  xie200107.txt
>> xie199512.txt  xie199705.txt  xie199810.txt  xie200003.txt  xie200108.txt
>> xie199601.txt  xie199706.txt  xie199811.txt  xie200004.txt  xie200109.txt
>> xie199602.txt  xie199707.txt  xie199812.txt  xie200005.txt  xie200110.txt
>> xie199603.txt  xie199708.txt  xie199901.txt  xie200006.txt  xie200111.txt
>> xie199604.txt  xie199709.txt  xie199902.txt  xie200007.txt
>> xie199605.txt  xie199710.txt  xie199903.txt  xie200008.txt
>>
>> t...@maraca:~/Corpora/TDP-XIE$ wc *
>>     6202   1184507   7024023 xie199501.txt
>>     6010   1134793   6715677 xie199502.txt
>>     7400   1444049   8550899 xie199503.txt
>>     6524   1255501   7446336 xie199504.txt
>>     7402   1450923   8568172 xie199505.txt
>>     7170   1409804   8332601 xie199506.txt
>>     6927   1303696   7719045 xie199507.txt
>>     7275   1413761   8299012 xie199508.txt
>>     6876   1372918   8128653 xie199509.txt
>>     7286   1405115   8364952 xie199510.txt
>>     7450   1388922   8224930 xie199511.txt
>>     6672   1300072   7715936 xie199512.txt
>>     7188   1334015   7862867 xie199601.txt
>>     6696   1215131   7156749 xie199602.txt
>>     8033   1519423   9013013 xie199603.txt
>>     7788   1473207   8730995 xie199604.txt
>>     8129   1556154   9218247 xie199605.txt
>>     8032   1549764   9194048 xie199606.txt
>>     8514   1640081   9639336 xie199607.txt
>>     8004   1531682   9038373 xie199608.txt
>>     7828   1524511   9058350 xie199609.txt
>>     8060   1549633   9218619 xie199610.txt
>>     7734   1506007   8968417 xie199611.txt
>>     7452   1465136   8689718 xie199612.txt
>>     7794   1499399   8837231 xie199701.txt
>>     6928   1333855   7875190 xie199702.txt
>>     8525   1721433  10204831 xie199703.txt
>>     7840   1546811   9144221 xie199704.txt
>>     8240   1637605   9657260 xie199705.txt
>>     7603   1511550   8911029 xie199706.txt
>>     7658   1501173   8862111 xie199707.txt
>>     7825   1495545   8777642 xie199708.txt
>>     8025   1639203   9736899 xie199709.txt
>>     8334   1661903   9854493 xie199710.txt
>>     8509   1663965   9882142 xie199711.txt
>>     8282   1657365   9812754 xie199712.txt
>>     8053   1550193   9154341 xie199801.txt
>>     7857   1555125   9163722 xie199802.txt
>>     8848   1742820  10342905 xie199803.txt
>>     8148   1581470   9393017 xie199804.txt
>>     8965   1752483  10382755 xie199805.txt
>>     8882   1749295  10302189 xie199806.txt
>>     9051   1788730  10542299 xie199807.txt
>>     8467   1690532   9940580 xie199808.txt
>>     8284   1728637  10189171 xie199809.txt
>>     8539   1739594  10310588 xie199810.txt
>>     8649   1707732  10109928 xie199811.txt
>>     9727   1867443  10940369 xie199812.txt
>>     8203   1603090   9426368 xie199901.txt
>>     7350   1461607   8615055 xie199902.txt
>>     9442   1882454  11126748 xie199903.txt
>>     9039   1788982  10560225 xie199904.txt
>>     8839   1759592  10389989 xie199905.txt
>>     8653   1732687  10220635 xie199906.txt
>>     8698   1742722  10224635 xie199907.txt
>>     8978   1751058  10251998 xie199908.txt
>>     9175   1850108  10899941 xie199909.txt
>>     8854   1733381  10273283 xie199910.txt
>>     8679   1658967   9789022 xie199911.txt
>>     8788   1716177  10116139 xie199912.txt
>>     8516   1606427   9434389 xie200001.txt
>>     8051   1571315   9239155 xie200002.txt
>>     9717   1895946  11166496 xie200003.txt
>>     9196   1830029  10819900 xie200004.txt
>>     9392   1805885  10689714 xie200005.txt
>>     9434   1826577  10834233 xie200006.txt
>>     9100   1790377  10553950 xie200007.txt
>>     9267   1818165  10695151 xie200008.txt
>>     9571   1779519  10427577 xie200009.txt
>>     8864   1796484  10646671 xie200010.txt
>>     8841   1731864  10225318 xie200011.txt
>>     8007   1623146   9549503 xie200012.txt
>>     7880   1480773   8681644 xie200101.txt
>>     8235   1581014   9272288 xie200102.txt
>>     9643   1937289  11393670 xie200103.txt
>>     8859   1748990  10261162 xie200104.txt
>>     8924   1758391  10328510 xie200105.txt
>>     8620   1716646  10131853 xie200106.txt
>>     8581   1709051  10080829 xie200107.txt
>>     9882   1983160  11583688 xie200108.txt
>>     8867   1728337  10179257 xie200109.txt
>>     9437   1793786  10614699 xie200110.txt
>>     1740    339343   1995533 xie200111.txt
>>   679007 132786005 783905863 total
>>
>>
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to