On Thursday, July 5, 2012 4:51:46 AM UTC+5:30, (unknown) wrote:
> Dear Group,
>
> I am Sri Subhabrata Banerjee trying to write from Gurgaon, India to discuss
> some coding issues. If any one of this learned room can shower some light I
> would be helpful enough.
>
> I got to code a bunch of documents which are combined together.
> Like,
>
> 1)A Mumbai-bound aircraft with 99 passengers on board was struck by lightning
> on Tuesday evening that led to complete communication failure in mid-air and
> forced the pilot to make an emergency landing.
> 2) The discovery of a new sub-atomic particle that is key to understanding
> how the universe is built has an intrinsic Indian connection.
> 3) A bomb explosion outside a shopping mall here on Tuesday left no one
> injured, but Nigerian authorities put security agencies on high alert fearing
> more such attacks in the city.
>
> The task is to separate the documents on the fly and to parse each of the
> documents with a definite set of rules.
>
> Now, the way I am processing is:
> I am clubbing all the documents together, as,
>
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning
> on Tuesday evening that led to complete communication failure in mid-air and
> forced the pilot to make an emergency landing.The discovery of a new
> sub-atomic particle that is key to understanding how the universe is built
> has an intrinsic Indian connection. A bomb explosion outside a shopping mall
> here on Tuesday left no one injured, but Nigerian authorities put security
> agencies on high alert fearing more such attacks in the city.
>
> But they are separated by a tag set, like,
> A Mumbai-bound aircraft with 99 passengers on board was struck by lightning
> on Tuesday evening that led to complete communication failure in mid-air and
> forced the pilot to make an emergency landing.$
> The discovery of a new sub-atomic particle that is key to understanding how
> the universe is built has an intrinsic Indian connection.$
> A bomb explosion outside a shopping mall here on Tuesday left no one injured,
> but Nigerian authorities put security agencies on high alert fearing more
> such attacks in the city.
>
> To detect the document boundaries, I am splitting them into a bag of words
> and using a simple for loop as,
> for i in range(len(bag_words)):
> if bag_words[i]=="$":
> print (bag_words[i],i)
>
> There is no issue. I am segmenting it nicely. I am using annotated corpus so
> applying parse rules.
>
> The confusion comes next,
>
> As per my problem statement the size of the file (of documents combined
> together) won’t increase on the fly. So, just to support all kinds of
> combinations I am appending in a list the “I” values, taking its length, and
> using slice. Works perfect. Question is, is there a smarter way to achieve
> this, and a curious question if the documents are on the fly with no
> preprocessed tag set like “$” how may I do it? From a bunch without EOF isn’t
> it a classification problem?
>
> There is no question on parsing it seems I am achieving it independent of
> length of the document.
>
> If any one in the group can suggest how I am dealing with the problem and
> which portions should be improved and how?
>
> Thanking You in Advance,
>
> Best Regards,
> Subhabrata Banerjee.
Thanks Peter but I feel your earlier one was better, I got an interesting one:
[i - 1 for i in range(len(f1)) if f1.startswith('$', i - 1)]
But I am bit intrigued with another question,
suppose I say:
file_open=open("/python32/doc1.txt","r")
file=a1.read().lower()
for line in file:
line_word=line.split()
This works fine. But if I print it would be printed continuously.
I like to store in some variable,so that I may print line of my choice and
manipulate them at my choice.
Is there any way out to this problem?
Regards,
Subhabrata Banerjee
--
http://mail.python.org/mailman/listinfo/python-list