Hi Merce,

Ah, now I understand. Fortunately there is a simple answer, I think.

count.pl cattest.out cattest --newLine

will cause the end of line markers to be respected, so ngrams will NOT
cross over them.

talisker(56): more cattest.out
7
cat<>is<>2 2 2
my<>friend<>2 2 2
friend<>the<>1 1 1
is<>my<>1 1 1
the<>cat<>1 1 1

So, I believe the --newLine option will do exactly as you require!

Please let me know if there are any other questions or concerns.

Thanks!
Ted

On Wed, Jul 1, 2009 at 1:04 PM, mercevg<merc...@yahoo.es> wrote:
>
>
> Dear Ted,
>
> In my case, I would like to get all the ngrams except those that cross over
> the end of line. In your example:
>
> the cat is
> my friend the
> cat is my friend
>
> I don't want to get as ngrams "is my" and "the cat", those having a new line
> in the
> middle of it.
>
> As you said, by default count.pl simply ignores end of line markers. But,
> it's possible not ignore end of line markers?
>
> Thanks a lot!
> Mercè
>
> --- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote:
>>
>> Greetings Merce,
>>
>> To make sure I understand correctly, it sounds like you *only* want to
>> see those ngrams that contain a line break. For example, if you run
>> count.pl as follows on your test file
>>
>> first line of text
>> second line
>> And a third line of text
>>
>> count.pl test.out test
>>
>> talisker(8): more test.out
>> 11
>> line<>of<>2 3 2
>> of<>text<>2 2 2
>> line<>And<>1 3 1
>> And<>a<>1 1 1
>> a<>third<>1 1 1
>> second<>line<>1 1 3
>> third<>line<>1 1 3
>> first<>line<>1 1 3
>> text<>second<>1 1 1
>>
>> You will get the bigrams that cross over the end of line - (text,
>> second, line And), but you also get all the other ngrams too...and so
>> it sounds to me like you only want the ones that cross over the new
>> line markers, and nothing else. Is that accurate?
>>
>> By default count.pl simply ignores end of line markers (the behavior
>> you see above). So, it's not so much that the ngram includes the new
>> line, it simply ignores it. So with a file like
>>
>> the cat is
>> my friend the
>> cat is my friend
>>
>> the 2 occurrences of "the cat" would be considered identical, even
>> though the second could be thought of as having a new line in the
>> middle of it (but we essentially ignore that).
>>
>> So...at the moment at least I'm not sure how to limit the output to
>> only those ngrams that are made by crossing over a new line
>> marker....But, let me make sure I am understanding things correctly
>> (so do let me know if I'm wrong) and I'll give this a little more
>> thought too.
>>
>> Cordially,
>> Ted
>>
>>
>> On Wed, Jul 1, 2009 at 12:15 PM, mercevg<merc...@...> wrote:
>> >
>> >
>> > Dear all,
>> >
>> > I would like to know if it's possible to get ngrams without containing
>> > line
>> > breaks from the corpus. I'll try to explain clearly: if the input text
>> > file
>> > is
>> >
>> > first line of text
>> > second line
>> > And a third line of text
>> >
>> > Then, we'll get with count.pl two bigrams containing like breaks:
>> >
>> > text second
>> > line And
>> >
>> > Or trigrams:
>> > of text second
>> > text second line
>> > second line And
>> >
>> > And so on.
>> >
>> > Taking into account these outputs, and after reading help text, I don't
>> > know
>> > if I can change default count.pl options to get all ngrams from the
>> > corpus
>> > except the ngrams containing words placed at the end of one sentence and
>> > words that are at the begining of the next sentence. That is, ngram
>> > without
>> > containing line breaks.
>> >
>> > Best wishes,
>> > Mercè
>> >
>> >
>>
>>
>>
>> --
>> Ted Pedersen
>> http://www.d.umn.edu/~tpederse
>>
>
> 



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to