There is not a way to make (or case insensitive. It
will take the input pretty much "as is" and use that. So, I think you'd
need to lower case your files before they made it to You can
use --token to specify how you tokenize words (like do you treat don't as
three tokens (don ' t) or one (don't). --stop lets you exclude words from
being counted, but there isn't anything that lets you ignore case.

On Tue, Apr 17, 2018 at 8:51 AM, Ted Pedersen <> wrote:

> Hi Catherine,
> Here are a few answers to your questions, hopefully.
> I don't think we'll be able to update this code anytime soon - we just
> don't have anyone available to work on that right now, unfortunately. That
> said we are very open to others making contributions, fixes, etc.
> The number of files that your system allows is pretty dependent on your
> system and operating system. On Linux you can actually adjust that (if you
> have sudo access) by running
> ulimit -s 100000
> or
> ulimit -s unlimited
> This increases the number of processes that can be run at any one time,
> which can allow your system to handle more command line arguments (since
> each file name probably causes it's own process to be created...?,
> speculating just a bit there) But if you don't have sudo access this is not
> something you can do.
> As far as taking multiple outputs from and merging them
> with huge-merge, I think the answer is that's almost possible, but not
> quite. huge-merge is not expecting the bigram count that appears on the
> first line of output to be there, and seems to fail as a
> result. So you would need to remove that first line from your
> output before merging.
> The commands below kind of break down what is happening within
> If you run this you can get an idea of the input output
> expected by each stage...
> --tokenlist input1.out input1
> --tokenlist input2.out input2
> --keep input1.out
> --keep input2.out
> mkdir output-directory
> mv input1.out-sorted output-directory
> mv input2.out-sorted output-directory
> --keep output-directory
> I hope this helps. I realize it's not exactly a solution, but I hope it's
> helpful all the same. I'll go through your notes again and see if there are
> other issues to address...and of course if you try something and it does or
> doesn't work I'm very interested in hearing about that...
> Cordially,
> Ted
> On Tue, Apr 17, 2018 at 7:33 AM, Ted Pedersen <> wrote:
>> The good news is that our documentation is more reliable than my memory.
>> :) huge-count treats each file separately and so bigrams do not cross file
>> boundaries. Having verified that I'll get back to your original question..
>> Sorry about the diversion and the confusion that might have caused.
>> More soon,
>> Ted
>> On Mon, Apr 16, 2018 at 4:11 PM, Ted Pedersen <> wrote:
>>> Let me go back and revisit this again, I seem to have confused myself!
>>> More soon,
>>> Ted
>>> On Mon, Apr 16, 2018 at 12:55 PM, [ngram]
>>> <> wrote:
>>>> Did I misread the documentation then?
>>>> " doesn't consider bigrams at file boundaries. In other
>>>> words,
>>>> the result of and on the same data file will
>>>> differ if --newLine is not used, in that, runs
>>>> on multiple files separately and thus looses the track of the bigrams
>>>> on file boundaries. With --window not specified, there will be loss
>>>> of one bigram at each file boundary while its W bigrams with --window
>>>> W."
>>>> I thought that means bigrams won't cross from one file to the next?
>>>> If bigrams don't cross from one file to the next, then I just need to
>>>> run on smaller inputs, then combine. So if I break
>>>> @filenames into smaller subsets, then call on the
>>>> subsets, then call to combine the counts, I think that
>>>> should work.
>>>> I have a few more questions related to usage:
>>>>    - Do you know how many arguments are allowed for It
>>>>    would be good to know what size chunks I need to split my data into.. 
>>>> Or if
>>>>    not, then how would I do a try catch block to catch the error "Argument
>>>>    list to long" from the IPC::System::Simple::system call?
>>>>    - Is there case-insensitive way to count bigrams, or would I need
>>>>    to convert all the text to lowercase before calling
>>>>    - Would you consider modifying so that the user can
>>>>    specify the final output filename, instead of just automatically calling
>>>>    the output file complete-huge-count.output?
>>>> Thank you,
>>>> Catherine

Reply via email to