Greetings Merce,

To make sure I understand correctly, it sounds like you *only* want to
see those ngrams that contain a line break. For example, if you run
count.pl as follows on your test file

first line of text
second line
And a third line of text

count.pl test.out test

talisker(8): more test.out
11
line<>of<>2 3 2
of<>text<>2 2 2
line<>And<>1 3 1
And<>a<>1 1 1
a<>third<>1 1 1
second<>line<>1 1 3
third<>line<>1 1 3
first<>line<>1 1 3
text<>second<>1 1 1

You will get the bigrams that cross over the end of line - (text,
second, line And), but you also get all the other ngrams too...and so
it sounds to me like you only want the ones that cross over the new
line markers, and nothing else. Is that accurate?

By default count.pl simply ignores end of line markers (the behavior
you see above). So, it's not so much that the ngram includes the new
line, it simply ignores it. So with a file like

the cat is
my friend the
cat is my friend

the 2 occurrences of "the cat" would be considered identical, even
though the second could be thought of as having a new line in the
middle of it (but we essentially ignore that).

So...at the moment at least I'm not sure how to limit the output to
only those ngrams that are made by crossing over a new line
marker....But, let me make sure I am understanding things correctly
(so do let me know if I'm wrong) and I'll give this a little more
thought too.

Cordially,
Ted


On Wed, Jul 1, 2009 at 12:15 PM, mercevg<merc...@yahoo.es> wrote:
>
>
> Dear all,
>
> I would like to know if it's possible to get ngrams without containing line
> breaks from the corpus. I'll try to explain clearly: if the input text file
> is
>
> first line of text
> second line
> And a third line of text
>
> Then, we'll get with count.pl two bigrams containing like breaks:
>
> text second
> line And
>
> Or trigrams:
> of text second
> text second line
> second line And
>
> And so on.
>
> Taking into account these outputs, and after reading help text, I don't know
> if I can change default count.pl options to get all ngrams from the corpus
> except the ngrams containing words placed at the end of one sentence and
> words that are at the begining of the next sentence. That is, ngram without
> containing line breaks.
>
> Best wishes,
> Mercè
>
> 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to