Re: Some Issues on Tagging Text

subhabangalore Sat, 26 May 2018 04:08:30 -0700

On Saturday, May 26, 2018 at 3:54:37 AM UTC+5:30, Cameron Simpson wrote:
> On 25May2018 04:23, Subhabrata Banerjee  wrote:
> >On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote:
> >> On 24May2018 03:13, wrote:
> >> >I have a text as,
> >> >
> >> >"Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption 
> >> >of Kilauea volcano in Hawaii sparked new safety warnings about toxic gas 
> >> >on the Big Island's southern coastline after lava began flowing into the 
> >> >ocean and setting off a chemical reaction. Lava haze is made of dense 
> >> >white clouds of steam, toxic gas and tiny shards of volcanic glass. Janet 
> >> >Babb, a geologist with the Hawaiian Volcano Observatory, says the plume 
> >> >"looks innocuous, but it's not." "Just like if you drop a glass on your 
> >> >kitchen floor, there's some large pieces and there are some very, very 
> >> >tiny pieces," Babb said. "These little tiny pieces are the ones that can 
> >> >get wafted up in that steam plume." Scientists call the glass Limu O 
> >> >Pele, or Pele's seaweed, named after the Hawaiian goddess of volcano and 
> >> >fire"
> >> >
> >> >and I want to see its tagged output as,
> >> >
> >> >"Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The 
> >> >eruption of Kilauea/TAG volcano/TAG in Hawaii/TAG sparked new safety 
> >> >warnings about toxic gas on the Big Island's southern coastline after 
> >> >lava began flowing into the ocean and setting off a chemical reaction. 
> >> >Lava haze is made of dense white clouds of steam, toxic gas and tiny 
> >> >shards of volcanic glass. Janet/TAG Babb/TAG, a geologist with the 
> >> >Hawaiian/TAG Volcano/TAG Observatory/TAG, says the plume "looks 
> >> >innocuous, but it's not." "Just like if you drop a glass on your kitchen 
> >> >floor, there's some large pieces and there are some very, very tiny 
> >> >pieces," Babb/TAG said. "These little tiny pieces are the ones that can 
> >> >get wafted up in that steam plume." Scientists call the glass Limu/TAG 
> >> >O/TAG Pele/TAG, or Pele's seaweed, named after the Hawaiian goddess of 
> >> >volcano and fire"
> >> >
> >> >To do this I generally try to take a list at the back end as,
> >> >
> >> >Hawaii
> >> >PAHOA
> [...]
> >> >and do a simple code as follows,
> >> >
> >> >def tag_text():
> >> >    corpus=open("/python27/volcanotxt.txt","r").read().split()
> >> >    wordlist=open("/python27/taglist.txt","r").read().split()
> [...]
> >> >    list1=[]
> >> >    for word in corpus:
> >> >        if word in wordlist:
> >> >            word_new=word+"/TAG"
> >> >            list1.append(word_new)
> >> >        else:
> >> >            list1.append(word)
> >> >    lst1=list1
> >> >    tagged_text=" ".join(lst1)
> >> >    print tagged_text
> >> >
> >> >get the results and hand repair unwanted tags Hawaiian/TAG goddess of 
> >> >volcano/TAG.
> >> >I am looking for a better approach of coding so that I need not spend 
> >> >time on
> >> >hand repairing.
> >>
> >> It isn't entirely clear to me why these two taggings are unwanted. 
> >> Intuitively,
> >> they seem to be either because "Hawaiian goddess" is a compound term where 
> >> you
> >> don't want "Hawaiian" to get a tag, or because "Hawaiian" has already 
> >> received
> >> a tag earlier in the list. Or are there other criteria.
> >>
> >> If you want to solve this problem with a programme you must first clearly
> >> define what makes an unwanted tag "unwanted". [...]
> >
> >By unwanted I did not mean anything so intricate.
> >Unwanted meant things I did not want.
> 
> That much was clear, but you need to specify in your own mind _precisely_ 
> what 
> makes some things unwanted and others wanted. Without concrete criteria you 
> can't write code to implement those criteria.
> 
> I'm not saying "you need to imagine code to match these things": you're 
> clearly 
> capable of doing that. I'm saying you need to have well defined concepts of 
> what makes something unwanted (or, if that is easier to define, wanted).  You 
> can do that iteratively: start with your basic concept and see how well it 
> works. When those concepts don't give you the outcome you desire, consider a 
> specific example which isn't working and try to figure out what additional 
> criterion would let you distinguish it from a working example.
> 
> >For example,
> >if my target phrases included terms like,
> >government of Mexico,
> >
> >now in my list I would have words with their tags as,
> >government
> >of
> >Mexico
> >
> >If I put these words in list it would tag
> >government/TAG of/TAG Mexico
> >
> >but would also tag all the "of" which may be
> >anywhere like haze is made of/TAG dense white,
> >clouds of/TAG steam, etc.
> >
> >Cleaning these unwanted places become a daunting task
> >to me.
> 
> Richard Damon has pointed out that you seem to want phrases instead of just 
> words.
> 
> >I have been experimenting around
> >wordlist=["Kilauea volcano","Kilauea/TAG 
> >volcano/TAG"),("Hawaii","Hawaii/TAG"),...]
> >tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus)
> >
> >is giving me sizeably good result but size of the wordlist is slight concern.
> 
> You can reduce that list by generating the "wordlist" form from something 
> smaller:
> 
>   base_phrases = ["Kilauea volcano", "government of Mexico", "Hawaii"]
>   wordlist = [
>       (base_phrase, " ".join([word + "/TAG" for word in base_phrase.split()]))
>       for base_phrase in base_phrases
>   ]
> 
> You could even autosplit the longer phrases so that your base_phrases 
> _automatically_ becomes:
> 
>   base_phrases = ["Kilauea volcano", "Kilauea", "volcano", "government of 
>   Mexico", "government", "Mexico", "Hawaii"]
> 
> That way your "replace" call would find the longer phrases before the shorter 
> phrases and thus _not_ tag the single words if they occurred in a longer 
> phrase, while still tagging the single words when they _didn't_ land in a 
> longer phrase.
> 
> Also, it is unclear to me whether "/TAG" is a fixed string or intended to be 
> distinct such as "/PROPER_NOUN", "/LOCATION" etc. If they vary then you need 
> a 
> more elaborate setup.
> 
> It sounds like you want a more general purpose parser, and that depends upon 
> your purposes. If you're coding to learn the basics of breaking up text, what 
> you're doing is fine and I'd stick with it. But if you're just after the 
> outcome (tags), you could use other libraries to break up the text.
> 
> For example, the Natural Language ToolKit (NLTK) will do structured parsing 
> of 
> text and return you a syntax tree, and it has many other facilities. Doco:
> 
>   http://www.nltk.org/
> 
> PyPI module:
> 
>   https://pypi.org/project/nltk/
> 
> which you can install with the command:
> 
>   pip install --user nltk
> 
> That would get you a tree structure of the corpus, which you could process 
> more 
> meaningfully. For example, you could traverse the tree and tag higher level 
> nodes as you came across them, possibly then _not_ traversing their inner 
> nodes. The effect of that would be that if you hit the grammatic node:
> 
>   government of Mexico
> 
> you might tags that node with "ORGANISATION", and choose not to descend 
> inside 
> it, thus avoiding tagging "government" and "of" and so forth because you have 
> a 
> high level tags. Nodes not specially recognised you're keep descending into, 
> tagging smaller things.
> 
> Cheers,
> Cameron Simpson


Dear Sir, 

Thank you for your kind and valuable suggestions. Thank you for your kind time 
too. 
I know NLTK and machine learning. I am of belief if I may use language properly 
we need machine learning-the least. 
So, I am trying to design a tagger without the help of machine learning, by 
simple Python coding. I have thus removed standard Parts of Speech(PoS) or 
Named Entity (NE) tagging scheme. 
I am trying to design a basic model if required may be implemented on any one 
of these problems. 
Detecting longer phrase is slightly a problem now I am thinking to employ 
re.search(pattern,text). If this part is done I do not need machine learning. 
Maintaining so much data is a cumbersome issue in machine learning. 

My regards to all other esteemed coders and members of the group for their kind 
and valuable time and valuable suggestions. 


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Some Issues on Tagging Text

Reply via email to