Hi Merce, Ah, now I understand. Fortunately there is a simple answer, I think.
count.pl cattest.out cattest --newLine will cause the end of line markers to be respected, so ngrams will NOT cross over them. talisker(56): more cattest.out 7 cat<>is<>2 2 2 my<>friend<>2 2 2 friend<>the<>1 1 1 is<>my<>1 1 1 the<>cat<>1 1 1 So, I believe the --newLine option will do exactly as you require! Please let me know if there are any other questions or concerns. Thanks! Ted On Wed, Jul 1, 2009 at 1:04 PM, mercevg<merc...@yahoo.es> wrote: > > > Dear Ted, > > In my case, I would like to get all the ngrams except those that cross over > the end of line. In your example: > > the cat is > my friend the > cat is my friend > > I don't want to get as ngrams "is my" and "the cat", those having a new line > in the > middle of it. > > As you said, by default count.pl simply ignores end of line markers. But, > it's possible not ignore end of line markers? > > Thanks a lot! > Mercè > > --- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote: >> >> Greetings Merce, >> >> To make sure I understand correctly, it sounds like you *only* want to >> see those ngrams that contain a line break. For example, if you run >> count.pl as follows on your test file >> >> first line of text >> second line >> And a third line of text >> >> count.pl test.out test >> >> talisker(8): more test.out >> 11 >> line<>of<>2 3 2 >> of<>text<>2 2 2 >> line<>And<>1 3 1 >> And<>a<>1 1 1 >> a<>third<>1 1 1 >> second<>line<>1 1 3 >> third<>line<>1 1 3 >> first<>line<>1 1 3 >> text<>second<>1 1 1 >> >> You will get the bigrams that cross over the end of line - (text, >> second, line And), but you also get all the other ngrams too...and so >> it sounds to me like you only want the ones that cross over the new >> line markers, and nothing else. Is that accurate? >> >> By default count.pl simply ignores end of line markers (the behavior >> you see above). So, it's not so much that the ngram includes the new >> line, it simply ignores it. So with a file like >> >> the cat is >> my friend the >> cat is my friend >> >> the 2 occurrences of "the cat" would be considered identical, even >> though the second could be thought of as having a new line in the >> middle of it (but we essentially ignore that). >> >> So...at the moment at least I'm not sure how to limit the output to >> only those ngrams that are made by crossing over a new line >> marker....But, let me make sure I am understanding things correctly >> (so do let me know if I'm wrong) and I'll give this a little more >> thought too. >> >> Cordially, >> Ted >> >> >> On Wed, Jul 1, 2009 at 12:15 PM, mercevg<merc...@...> wrote: >> > >> > >> > Dear all, >> > >> > I would like to know if it's possible to get ngrams without containing >> > line >> > breaks from the corpus. I'll try to explain clearly: if the input text >> > file >> > is >> > >> > first line of text >> > second line >> > And a third line of text >> > >> > Then, we'll get with count.pl two bigrams containing like breaks: >> > >> > text second >> > line And >> > >> > Or trigrams: >> > of text second >> > text second line >> > second line And >> > >> > And so on. >> > >> > Taking into account these outputs, and after reading help text, I don't >> > know >> > if I can change default count.pl options to get all ngrams from the >> > corpus >> > except the ngrams containing words placed at the end of one sentence and >> > words that are at the begining of the next sentence. That is, ngram >> > without >> > containing line breaks. >> > >> > Best wishes, >> > Mercè >> > >> > >> >> >> >> -- >> Ted Pedersen >> http://www.d.umn.edu/~tpederse >> > > -- Ted Pedersen http://www.d.umn.edu/~tpederse