On 25May2018 04:23, Subhabrata Banerjee <subhabangal...@gmail.com> wrote:
On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote:
On 24May2018 03:13, wrote:
>I have a text as,
>
>"Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption of Kilauea volcano in Hawaii sparked 
new safety warnings about toxic gas on the Big Island's southern coastline after lava began flowing into the ocean and 
setting off a chemical reaction. Lava haze is made of dense white clouds of steam, toxic gas and tiny shards of volcanic 
glass. Janet Babb, a geologist with the Hawaiian Volcano Observatory, says the plume "looks innocuous, but it's 
not." "Just like if you drop a glass on your kitchen floor, there's some large pieces and there are some very, 
very tiny pieces," Babb said. "These little tiny pieces are the ones that can get wafted up in that steam 
plume." Scientists call the glass Limu O Pele, or Pele's seaweed, named after the Hawaiian goddess of volcano and 
fire"
>
>and I want to see its tagged output as,
>
>"Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The eruption of Kilauea/TAG volcano/TAG in 
Hawaii/TAG sparked new safety warnings about toxic gas on the Big Island's southern coastline after lava began flowing 
into the ocean and setting off a chemical reaction. Lava haze is made of dense white clouds of steam, toxic gas and tiny 
shards of volcanic glass. Janet/TAG Babb/TAG, a geologist with the Hawaiian/TAG Volcano/TAG Observatory/TAG, says the 
plume "looks innocuous, but it's not." "Just like if you drop a glass on your kitchen floor, there's some 
large pieces and there are some very, very tiny pieces," Babb/TAG said. "These little tiny pieces are the ones 
that can get wafted up in that steam plume." Scientists call the glass Limu/TAG O/TAG Pele/TAG, or Pele's seaweed, 
named after the Hawaiian goddess of volcano and fire"
>
>To do this I generally try to take a list at the back end as,
>
>Hawaii
>PAHOA
[...]
>and do a simple code as follows,
>
>def tag_text():
>    corpus=open("/python27/volcanotxt.txt","r").read().split()
>    wordlist=open("/python27/taglist.txt","r").read().split()
[...]
>    list1=[]
>    for word in corpus:
>        if word in wordlist:
>            word_new=word+"/TAG"
>            list1.append(word_new)
>        else:
>            list1.append(word)
>    lst1=list1
>    tagged_text=" ".join(lst1)
>    print tagged_text
>
>get the results and hand repair unwanted tags Hawaiian/TAG goddess of 
volcano/TAG.
>I am looking for a better approach of coding so that I need not spend time on
>hand repairing.

It isn't entirely clear to me why these two taggings are unwanted. Intuitively,
they seem to be either because "Hawaiian goddess" is a compound term where you
don't want "Hawaiian" to get a tag, or because "Hawaiian" has already received
a tag earlier in the list. Or are there other criteria.

If you want to solve this problem with a programme you must first clearly
define what makes an unwanted tag "unwanted". [...]

By unwanted I did not mean anything so intricate.
Unwanted meant things I did not want.

That much was clear, but you need to specify in your own mind _precisely_ what makes some things unwanted and others wanted. Without concrete criteria you can't write code to implement those criteria.

I'm not saying "you need to imagine code to match these things": you're clearly capable of doing that. I'm saying you need to have well defined concepts of what makes something unwanted (or, if that is easier to define, wanted). You can do that iteratively: start with your basic concept and see how well it works. When those concepts don't give you the outcome you desire, consider a specific example which isn't working and try to figure out what additional criterion would let you distinguish it from a working example.

For example,
if my target phrases included terms like,
government of Mexico,

now in my list I would have words with their tags as,
government
of
Mexico

If I put these words in list it would tag
government/TAG of/TAG Mexico

but would also tag all the "of" which may be
anywhere like haze is made of/TAG dense white,
clouds of/TAG steam, etc.

Cleaning these unwanted places become a daunting task
to me.

Richard Damon has pointed out that you seem to want phrases instead of just words.

I have been experimenting around
wordlist=["Kilauea volcano","Kilauea/TAG 
volcano/TAG"),("Hawaii","Hawaii/TAG"),...]
tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus)

is giving me sizeably good result but size of the wordlist is slight concern.

You can reduce that list by generating the "wordlist" form from something smaller:

 base_phrases = ["Kilauea volcano", "government of Mexico", "Hawaii"]
 wordlist = [
     (base_phrase, " ".join([word + "/TAG" for word in base_phrase.split()]))
     for base_phrase in base_phrases
 ]

You could even autosplit the longer phrases so that your base_phrases _automatically_ becomes:

base_phrases = ["Kilauea volcano", "Kilauea", "volcano", "government of Mexico", "government", "Mexico", "Hawaii"]

That way your "replace" call would find the longer phrases before the shorter phrases and thus _not_ tag the single words if they occurred in a longer phrase, while still tagging the single words when they _didn't_ land in a longer phrase.

Also, it is unclear to me whether "/TAG" is a fixed string or intended to be distinct such as "/PROPER_NOUN", "/LOCATION" etc. If they vary then you need a more elaborate setup.

It sounds like you want a more general purpose parser, and that depends upon your purposes. If you're coding to learn the basics of breaking up text, what you're doing is fine and I'd stick with it. But if you're just after the outcome (tags), you could use other libraries to break up the text.

For example, the Natural Language ToolKit (NLTK) will do structured parsing of text and return you a syntax tree, and it has many other facilities. Doco:

 http://www.nltk.org/

PyPI module:

 https://pypi.org/project/nltk/

which you can install with the command:

 pip install --user nltk

That would get you a tree structure of the corpus, which you could process more meaningfully. For example, you could traverse the tree and tag higher level nodes as you came across them, possibly then _not_ traversing their inner nodes. The effect of that would be that if you hit the grammatic node:

 government of Mexico

you might tags that node with "ORGANISATION", and choose not to descend inside it, thus avoiding tagging "government" and "of" and so forth because you have a high level tags. Nodes not specially recognised you're keep descending into, tagging smaller things.

Cheers,
Cameron Simpson <c...@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to