Dear Ted, In my case, I would like to get all the ngrams except those that cross over the end of line. In your example:
the cat is my friend the cat is my friend I don't want to get as ngrams "is my" and "the cat", those having a new line in the middle of it. As you said, by default count.pl simply ignores end of line markers. But, it's possible not ignore end of line markers? Thanks a lot! Mercè --- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote: > > Greetings Merce, > > To make sure I understand correctly, it sounds like you *only* want to > see those ngrams that contain a line break. For example, if you run > count.pl as follows on your test file > > first line of text > second line > And a third line of text > > count.pl test.out test > > talisker(8): more test.out > 11 > line<>of<>2 3 2 > of<>text<>2 2 2 > line<>And<>1 3 1 > And<>a<>1 1 1 > a<>third<>1 1 1 > second<>line<>1 1 3 > third<>line<>1 1 3 > first<>line<>1 1 3 > text<>second<>1 1 1 > > You will get the bigrams that cross over the end of line - (text, > second, line And), but you also get all the other ngrams too...and so > it sounds to me like you only want the ones that cross over the new > line markers, and nothing else. Is that accurate? > > By default count.pl simply ignores end of line markers (the behavior > you see above). So, it's not so much that the ngram includes the new > line, it simply ignores it. So with a file like > > the cat is > my friend the > cat is my friend > > the 2 occurrences of "the cat" would be considered identical, even > though the second could be thought of as having a new line in the > middle of it (but we essentially ignore that). > > So...at the moment at least I'm not sure how to limit the output to > only those ngrams that are made by crossing over a new line > marker....But, let me make sure I am understanding things correctly > (so do let me know if I'm wrong) and I'll give this a little more > thought too. > > Cordially, > Ted > > > On Wed, Jul 1, 2009 at 12:15 PM, mercevg<merc...@...> wrote: > > > > > > Dear all, > > > > I would like to know if it's possible to get ngrams without containing line > > breaks from the corpus. I'll try to explain clearly: if the input text file > > is > > > > first line of text > > second line > > And a third line of text > > > > Then, we'll get with count.pl two bigrams containing like breaks: > > > > text second > > line And > > > > Or trigrams: > > of text second > > text second line > > second line And > > > > And so on. > > > > Taking into account these outputs, and after reading help text, I don't know > > if I can change default count.pl options to get all ngrams from the corpus > > except the ngrams containing words placed at the end of one sentence and > > words that are at the begining of the next sentence. That is, ngram without > > containing line breaks. > > > > Best wishes, > > Mercè > > > > > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >