Re: Some Issues on Tagging Text

Cameron Simpson Fri, 25 May 2018 15:28:22 -0700

On 25May2018 04:23, Subhabrata Banerjee <subhabangal...@gmail.com> wrote:

On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote:

On 24May2018 03:13, wrote:
>I have a text as,
>
>"Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption of Kilauea volcano in Hawaii sparked 
new safety warnings about toxic gas on the Big Island's southern coastline after lava began flowing into the ocean and 
setting off a chemical reaction. Lava haze is made of dense white clouds of steam, toxic gas and tiny shards of volcanic 
glass. Janet Babb, a geologist with the Hawaiian Volcano Observatory, says the plume "looks innocuous, but it's 
not." "Just like if you drop a glass on your kitchen floor, there's some large pieces and there are some very, 
very tiny pieces," Babb said. "These little tiny pieces are the ones that can get wafted up in that steam 
plume." Scientists call the glass Limu O Pele, or Pele's seaweed, named after the Hawaiian goddess of volcano and 
fire"
>
>and I want to see its tagged output as,
>
>"Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The eruption of Kilauea/TAG volcano/TAG in 
Hawaii/TAG sparked new safety warnings about toxic gas on the Big Island's southern coastline after lava began flowing 
into the ocean and setting off a chemical reaction. Lava haze is made of dense white clouds of steam, toxic gas and tiny 
shards of volcanic glass. Janet/TAG Babb/TAG, a geologist with the Hawaiian/TAG Volcano/TAG Observatory/TAG, says the 
plume "looks innocuous, but it's not." "Just like if you drop a glass on your kitchen floor, there's some 
large pieces and there are some very, very tiny pieces," Babb/TAG said. "These little tiny pieces are the ones 
that can get wafted up in that steam plume." Scientists call the glass Limu/TAG O/TAG Pele/TAG, or Pele's seaweed, 
named after the Hawaiian goddess of volcano and fire"
>
>To do this I generally try to take a list at the back end as,
>
>Hawaii
>PAHOA

[...]

>and do a simple code as follows,
>
>def tag_text():
>    corpus=open("/python27/volcanotxt.txt","r").read().split()
>    wordlist=open("/python27/taglist.txt","r").read().split()

[...]

>    list1=[]
>    for word in corpus:
>        if word in wordlist:
>            word_new=word+"/TAG"
>            list1.append(word_new)
>        else:
>            list1.append(word)
>    lst1=list1
>    tagged_text=" ".join(lst1)
>    print tagged_text
>
>get the results and hand repair unwanted tags Hawaiian/TAG goddess of 
volcano/TAG.
>I am looking for a better approach of coding so that I need not spend time on
>hand repairing.


It isn't entirely clear to me why these two taggings are unwanted. Intuitively,
they seem to be either because "Hawaiian goddess" is a compound term where you
don't want "Hawaiian" to get a tag, or because "Hawaiian" has already received
a tag earlier in the list. Or are there other criteria.

If you want to solve this problem with a programme you must first clearly
define what makes an unwanted tag "unwanted". [...]


By unwanted I did not mean anything so intricate.
Unwanted meant things I did not want.

That much was clear, but you need to specify in your own mind _precisely_ whatmakes some things unwanted and others wanted. Without concrete criteria youcan't write code to implement those criteria.

I'm not saying "you need to imagine code to match these things": you're clearlycapable of doing that. I'm saying you need to have well defined concepts ofwhat makes something unwanted (or, if that is easier to define, wanted). Youcan do that iteratively: start with your basic concept and see how well itworks. When those concepts don't give you the outcome you desire, consider aspecific example which isn't working and try to figure out what additionalcriterion would let you distinguish it from a working example.

For example,
if my target phrases included terms like,
government of Mexico,

now in my list I would have words with their tags as,
government
of
Mexico

If I put these words in list it would tag
government/TAG of/TAG Mexico

but would also tag all the "of" which may be
anywhere like haze is made of/TAG dense white,
clouds of/TAG steam, etc.

Cleaning these unwanted places become a daunting task
to me.

Richard Damon has pointed out that you seem to want phrases instead of justwords.

I have been experimenting around
wordlist=["Kilauea volcano","Kilauea/TAG 
volcano/TAG"),("Hawaii","Hawaii/TAG"),...]
tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus)

is giving me sizeably good result but size of the wordlist is slight concern.

You can reduce that list by generating the "wordlist" form from somethingsmaller:


 base_phrases = ["Kilauea volcano", "government of Mexico", "Hawaii"]
 wordlist = [
     (base_phrase, " ".join([word + "/TAG" for word in base_phrase.split()]))
     for base_phrase in base_phrases
 ]

You could even autosplit the longer phrases so that your base_phrases_automatically_ becomes:

base_phrases = ["Kilauea volcano", "Kilauea", "volcano", "government ofMexico", "government", "Mexico", "Hawaii"]

That way your "replace" call would find the longer phrases before the shorterphrases and thus _not_ tag the single words if they occurred in a longerphrase, while still tagging the single words when they _didn't_ land in alonger phrase.

Also, it is unclear to me whether "/TAG" is a fixed string or intended to bedistinct such as "/PROPER_NOUN", "/LOCATION" etc. If they vary then you need amore elaborate setup.

It sounds like you want a more general purpose parser, and that depends uponyour purposes. If you're coding to learn the basics of breaking up text, whatyou're doing is fine and I'd stick with it. But if you're just after theoutcome (tags), you could use other libraries to break up the text.

For example, the Natural Language ToolKit (NLTK) will do structured parsing oftext and return you a syntax tree, and it has many other facilities. Doco:


 http://www.nltk.org/

PyPI module:

 https://pypi.org/project/nltk/

which you can install with the command:

 pip install --user nltk

That would get you a tree structure of the corpus, which you could process moremeaningfully. For example, you could traverse the tree and tag higher levelnodes as you came across them, possibly then _not_ traversing their innernodes. The effect of that would be that if you hit the grammatic node:


 government of Mexico

you might tags that node with "ORGANISATION", and choose not to descend insideit, thus avoiding tagging "government" and "of" and so forth because you have ahigh level tags. Nodes not specially recognised you're keep descending into,tagging smaller things.


Cheers,
Cameron Simpson <c...@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Some Issues on Tagging Text

Reply via email to