Dear Ted, If I understand correctly, now I have to use count.pl in this way:
count.pl output.txt input.txt cattest.out cattest --newLine or maybe just count.pl output.txt input.txt --newLine Thanks a lot for your help. Mercè --- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote: > > Hi Merce, > > Ah, now I understand. Fortunately there is a simple answer, I think. > > count.pl cattest.out cattest --newLine > > will cause the end of line markers to be respected, so ngrams will NOT > cross over them. > > talisker(56): more cattest.out > 7 > cat<>is<>2 2 2 > my<>friend<>2 2 2 > friend<>the<>1 1 1 > is<>my<>1 1 1 > the<>cat<>1 1 1 > > So, I believe the --newLine option will do exactly as you require! > > Please let me know if there are any other questions or concerns. > > Thanks! > Ted > > On Wed, Jul 1, 2009 at 1:04 PM, mercevg<merc...@...> wrote: > > > > > > Dear Ted, > > > > In my case, I would like to get all the ngrams except those that cross over > > the end of line. In your example: > > > > the cat is > > my friend the > > cat is my friend > > > > I don't want to get as ngrams "is my" and "the cat", those having a new line > > in the > > middle of it. > > > > As you said, by default count.pl simply ignores end of line markers. But, > > it's possible not ignore end of line markers? > > > > Thanks a lot! > > Mercè > > > > --- In ngram@yahoogroups.com, Ted Pedersen <duluthted@> wrote: > >> > >> Greetings Merce, > >> > >> To make sure I understand correctly, it sounds like you *only* want to > >> see those ngrams that contain a line break. For example, if you run > >> count.pl as follows on your test file > >> > >> first line of text > >> second line > >> And a third line of text > >> > >> count.pl test.out test > >> > >> talisker(8): more test.out > >> 11 > >> line<>of<>2 3 2 > >> of<>text<>2 2 2 > >> line<>And<>1 3 1 > >> And<>a<>1 1 1 > >> a<>third<>1 1 1 > >> second<>line<>1 1 3 > >> third<>line<>1 1 3 > >> first<>line<>1 1 3 > >> text<>second<>1 1 1 > >> > >> You will get the bigrams that cross over the end of line - (text, > >> second, line And), but you also get all the other ngrams too...and so > >> it sounds to me like you only want the ones that cross over the new > >> line markers, and nothing else. Is that accurate? > >> > >> By default count.pl simply ignores end of line markers (the behavior > >> you see above). So, it's not so much that the ngram includes the new > >> line, it simply ignores it. So with a file like > >> > >> the cat is > >> my friend the > >> cat is my friend > >> > >> the 2 occurrences of "the cat" would be considered identical, even > >> though the second could be thought of as having a new line in the > >> middle of it (but we essentially ignore that). > >> > >> So...at the moment at least I'm not sure how to limit the output to > >> only those ngrams that are made by crossing over a new line > >> marker....But, let me make sure I am understanding things correctly > >> (so do let me know if I'm wrong) and I'll give this a little more > >> thought too. > >> > >> Cordially, > >> Ted > >> > >> > >> On Wed, Jul 1, 2009 at 12:15 PM, mercevg<mercevg@> wrote: > >> > > >> > > >> > Dear all, > >> > > >> > I would like to know if it's possible to get ngrams without containing > >> > line > >> > breaks from the corpus. I'll try to explain clearly: if the input text > >> > file > >> > is > >> > > >> > first line of text > >> > second line > >> > And a third line of text > >> > > >> > Then, we'll get with count.pl two bigrams containing like breaks: > >> > > >> > text second > >> > line And > >> > > >> > Or trigrams: > >> > of text second > >> > text second line > >> > second line And > >> > > >> > And so on. > >> > > >> > Taking into account these outputs, and after reading help text, I don't > >> > know > >> > if I can change default count.pl options to get all ngrams from the > >> > corpus > >> > except the ngrams containing words placed at the end of one sentence and > >> > words that are at the begining of the next sentence. That is, ngram > >> > without > >> > containing line breaks. > >> > > >> > Best wishes, > >> > Mercè > >> > > >> > > >> > >> > >> > >> -- > >> Ted Pedersen > >> http://www.d.umn.edu/~tpederse > >> > > > > > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >