[ngram] Re: Ngrams without line break

mercevg Wed, 01 Jul 2009 11:32:30 -0700

Dear Ted,

In my case, I would like to get all the ngrams except those that cross over the 
end of line. In your example:


the cat is
my friend the
cat is my friend

I don't want to get as ngrams "is my" and "the cat", those having a new line in 
the
middle of it. 

As you said, by default count.pl simply ignores end of line markers. But, it's 
possible not ignore end of line markers?  

Thanks a lot!
Mercè

--- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote:
>
> Greetings Merce,
> 
> To make sure I understand correctly, it sounds like you *only* want to
> see those ngrams that contain a line break. For example, if you run
> count.pl as follows on your test file
> 
> first line of text
> second line
> And a third line of text
> 
> count.pl test.out test
> 
> talisker(8): more test.out
> 11
> line<>of<>2 3 2
> of<>text<>2 2 2
> line<>And<>1 3 1
> And<>a<>1 1 1
> a<>third<>1 1 1
> second<>line<>1 1 3
> third<>line<>1 1 3
> first<>line<>1 1 3
> text<>second<>1 1 1
> 
> You will get the bigrams that cross over the end of line - (text,
> second, line And), but you also get all the other ngrams too...and so
> it sounds to me like you only want the ones that cross over the new
> line markers, and nothing else. Is that accurate?
> 
> By default count.pl simply ignores end of line markers (the behavior
> you see above). So, it's not so much that the ngram includes the new
> line, it simply ignores it. So with a file like
> 
> the cat is
> my friend the
> cat is my friend
> 
> the 2 occurrences of "the cat" would be considered identical, even
> though the second could be thought of as having a new line in the
> middle of it (but we essentially ignore that).
> 
> So...at the moment at least I'm not sure how to limit the output to
> only those ngrams that are made by crossing over a new line
> marker....But, let me make sure I am understanding things correctly
> (so do let me know if I'm wrong) and I'll give this a little more
> thought too.
> 
> Cordially,
> Ted
> 
> 
> On Wed, Jul 1, 2009 at 12:15 PM, mercevg<merc...@...> wrote:
> >
> >
> > Dear all,
> >
> > I would like to know if it's possible to get ngrams without containing line
> > breaks from the corpus. I'll try to explain clearly: if the input text file
> > is
> >
> > first line of text
> > second line
> > And a third line of text
> >
> > Then, we'll get with count.pl two bigrams containing like breaks:
> >
> > text second
> > line And
> >
> > Or trigrams:
> > of text second
> > text second line
> > second line And
> >
> > And so on.
> >
> > Taking into account these outputs, and after reading help text, I don't know
> > if I can change default count.pl options to get all ngrams from the corpus
> > except the ngrams containing words placed at the end of one sentence and
> > words that are at the begining of the next sentence. That is, ngram without
> > containing line breaks.
> >
> > Best wishes,
> > Mercè
> >
> > 
> 
> 
> 
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>

[ngram] Re: Ngrams without line break

Reply via email to