Dear all,

I would like to know if it's possible to get ngrams without containing line 
breaks from the corpus. I'll try to explain clearly: if the input text file is

first line of text
 second line
 And a third line of text

Then, we'll get with count.pl two bigrams containing like breaks: 

text second
line And

Or trigrams:
of text second
text second line
second line And

And so on.

Taking into account these outputs, and after reading help text, I don't know if 
I can change default count.pl options to get all ngrams from the corpus except 
the ngrams containing words placed at the end of one sentence and words that 
are at the begining of the next sentence. That is, ngram without containing 
line breaks.

Best wishes,
Mercè







Reply via email to