Greetings Merce, To make sure I understand correctly, it sounds like you *only* want to see those ngrams that contain a line break. For example, if you run count.pl as follows on your test file
first line of text second line And a third line of text count.pl test.out test talisker(8): more test.out 11 line<>of<>2 3 2 of<>text<>2 2 2 line<>And<>1 3 1 And<>a<>1 1 1 a<>third<>1 1 1 second<>line<>1 1 3 third<>line<>1 1 3 first<>line<>1 1 3 text<>second<>1 1 1 You will get the bigrams that cross over the end of line - (text, second, line And), but you also get all the other ngrams too...and so it sounds to me like you only want the ones that cross over the new line markers, and nothing else. Is that accurate? By default count.pl simply ignores end of line markers (the behavior you see above). So, it's not so much that the ngram includes the new line, it simply ignores it. So with a file like the cat is my friend the cat is my friend the 2 occurrences of "the cat" would be considered identical, even though the second could be thought of as having a new line in the middle of it (but we essentially ignore that). So...at the moment at least I'm not sure how to limit the output to only those ngrams that are made by crossing over a new line marker....But, let me make sure I am understanding things correctly (so do let me know if I'm wrong) and I'll give this a little more thought too. Cordially, Ted On Wed, Jul 1, 2009 at 12:15 PM, mercevg<merc...@yahoo.es> wrote: > > > Dear all, > > I would like to know if it's possible to get ngrams without containing line > breaks from the corpus. I'll try to explain clearly: if the input text file > is > > first line of text > second line > And a third line of text > > Then, we'll get with count.pl two bigrams containing like breaks: > > text second > line And > > Or trigrams: > of text second > text second line > second line And > > And so on. > > Taking into account these outputs, and after reading help text, I don't know > if I can change default count.pl options to get all ngrams from the corpus > except the ngrams containing words placed at the end of one sentence and > words that are at the begining of the next sentence. That is, ngram without > containing line breaks. > > Best wishes, > Mercè > > -- Ted Pedersen http://www.d.umn.edu/~tpederse