[ngram] Re: Ngrams without line break

mercevg Thu, 02 Jul 2009 04:39:48 -0700

Dear Ted,

If I understand correctly, now I have to use count.pl in this way:


count.pl output.txt input.txt cattest.out cattest --newLine

or maybe just

count.pl output.txt input.txt --newLine

Thanks a lot for your help. 

Mercè


--- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote:
>
> Hi Merce,
> 
> Ah, now I understand. Fortunately there is a simple answer, I think.
> 
> count.pl cattest.out cattest --newLine
> 
> will cause the end of line markers to be respected, so ngrams will NOT
> cross over them.
> 
> talisker(56): more cattest.out
> 7
> cat<>is<>2 2 2
> my<>friend<>2 2 2
> friend<>the<>1 1 1
> is<>my<>1 1 1
> the<>cat<>1 1 1
> 
> So, I believe the --newLine option will do exactly as you require!
> 
> Please let me know if there are any other questions or concerns.
> 
> Thanks!
> Ted
> 
> On Wed, Jul 1, 2009 at 1:04 PM, mercevg<merc...@...> wrote:
> >
> >
> > Dear Ted,
> >
> > In my case, I would like to get all the ngrams except those that cross over
> > the end of line. In your example:
> >
> > the cat is
> > my friend the
> > cat is my friend
> >
> > I don't want to get as ngrams "is my" and "the cat", those having a new line
> > in the
> > middle of it.
> >
> > As you said, by default count.pl simply ignores end of line markers. But,
> > it's possible not ignore end of line markers?
> >
> > Thanks a lot!
> > Mercè
> >
> > --- In ngram@yahoogroups.com, Ted Pedersen <duluthted@> wrote:
> >>
> >> Greetings Merce,
> >>
> >> To make sure I understand correctly, it sounds like you *only* want to
> >> see those ngrams that contain a line break. For example, if you run
> >> count.pl as follows on your test file
> >>
> >> first line of text
> >> second line
> >> And a third line of text
> >>
> >> count.pl test.out test
> >>
> >> talisker(8): more test.out
> >> 11
> >> line<>of<>2 3 2
> >> of<>text<>2 2 2
> >> line<>And<>1 3 1
> >> And<>a<>1 1 1
> >> a<>third<>1 1 1
> >> second<>line<>1 1 3
> >> third<>line<>1 1 3
> >> first<>line<>1 1 3
> >> text<>second<>1 1 1
> >>
> >> You will get the bigrams that cross over the end of line - (text,
> >> second, line And), but you also get all the other ngrams too...and so
> >> it sounds to me like you only want the ones that cross over the new
> >> line markers, and nothing else. Is that accurate?
> >>
> >> By default count.pl simply ignores end of line markers (the behavior
> >> you see above). So, it's not so much that the ngram includes the new
> >> line, it simply ignores it. So with a file like
> >>
> >> the cat is
> >> my friend the
> >> cat is my friend
> >>
> >> the 2 occurrences of "the cat" would be considered identical, even
> >> though the second could be thought of as having a new line in the
> >> middle of it (but we essentially ignore that).
> >>
> >> So...at the moment at least I'm not sure how to limit the output to
> >> only those ngrams that are made by crossing over a new line
> >> marker....But, let me make sure I am understanding things correctly
> >> (so do let me know if I'm wrong) and I'll give this a little more
> >> thought too.
> >>
> >> Cordially,
> >> Ted
> >>
> >>
> >> On Wed, Jul 1, 2009 at 12:15 PM, mercevg<mercevg@> wrote:
> >> >
> >> >
> >> > Dear all,
> >> >
> >> > I would like to know if it's possible to get ngrams without containing
> >> > line
> >> > breaks from the corpus. I'll try to explain clearly: if the input text
> >> > file
> >> > is
> >> >
> >> > first line of text
> >> > second line
> >> > And a third line of text
> >> >
> >> > Then, we'll get with count.pl two bigrams containing like breaks:
> >> >
> >> > text second
> >> > line And
> >> >
> >> > Or trigrams:
> >> > of text second
> >> > text second line
> >> > second line And
> >> >
> >> > And so on.
> >> >
> >> > Taking into account these outputs, and after reading help text, I don't
> >> > know
> >> > if I can change default count.pl options to get all ngrams from the
> >> > corpus
> >> > except the ngrams containing words placed at the end of one sentence and
> >> > words that are at the begining of the next sentence. That is, ngram
> >> > without
> >> > containing line breaks.
> >> >
> >> > Best wishes,
> >> > Mercè
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Ted Pedersen
> >> http://www.d.umn.edu/~tpederse
> >>
> >
> > 
> 
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>

[ngram] Re: Ngrams without line break

Reply via email to