On Saturday, May 26, 2018 at 3:54:37 AM UTC+5:30, Cameron Simpson wrote: > On 25May2018 04:23, Subhabrata Banerjee wrote: > >On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote: > >> On 24May2018 03:13, wrote: > >> >I have a text as, > >> > > >> >"Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption > >> >of Kilauea volcano in Hawaii sparked new safety warnings about toxic gas > >> >on the Big Island's southern coastline after lava began flowing into the > >> >ocean and setting off a chemical reaction. Lava haze is made of dense > >> >white clouds of steam, toxic gas and tiny shards of volcanic glass. Janet > >> >Babb, a geologist with the Hawaiian Volcano Observatory, says the plume > >> >"looks innocuous, but it's not." "Just like if you drop a glass on your > >> >kitchen floor, there's some large pieces and there are some very, very > >> >tiny pieces," Babb said. "These little tiny pieces are the ones that can > >> >get wafted up in that steam plume." Scientists call the glass Limu O > >> >Pele, or Pele's seaweed, named after the Hawaiian goddess of volcano and > >> >fire" > >> > > >> >and I want to see its tagged output as, > >> > > >> >"Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The > >> >eruption of Kilauea/TAG volcano/TAG in Hawaii/TAG sparked new safety > >> >warnings about toxic gas on the Big Island's southern coastline after > >> >lava began flowing into the ocean and setting off a chemical reaction. > >> >Lava haze is made of dense white clouds of steam, toxic gas and tiny > >> >shards of volcanic glass. Janet/TAG Babb/TAG, a geologist with the > >> >Hawaiian/TAG Volcano/TAG Observatory/TAG, says the plume "looks > >> >innocuous, but it's not." "Just like if you drop a glass on your kitchen > >> >floor, there's some large pieces and there are some very, very tiny > >> >pieces," Babb/TAG said. "These little tiny pieces are the ones that can > >> >get wafted up in that steam plume." Scientists call the glass Limu/TAG > >> >O/TAG Pele/TAG, or Pele's seaweed, named after the Hawaiian goddess of > >> >volcano and fire" > >> > > >> >To do this I generally try to take a list at the back end as, > >> > > >> >Hawaii > >> >PAHOA > [...] > >> >and do a simple code as follows, > >> > > >> >def tag_text(): > >> > corpus=open("/python27/volcanotxt.txt","r").read().split() > >> > wordlist=open("/python27/taglist.txt","r").read().split() > [...] > >> > list1=[] > >> > for word in corpus: > >> > if word in wordlist: > >> > word_new=word+"/TAG" > >> > list1.append(word_new) > >> > else: > >> > list1.append(word) > >> > lst1=list1 > >> > tagged_text=" ".join(lst1) > >> > print tagged_text > >> > > >> >get the results and hand repair unwanted tags Hawaiian/TAG goddess of > >> >volcano/TAG. > >> >I am looking for a better approach of coding so that I need not spend > >> >time on > >> >hand repairing. > >> > >> It isn't entirely clear to me why these two taggings are unwanted. > >> Intuitively, > >> they seem to be either because "Hawaiian goddess" is a compound term where > >> you > >> don't want "Hawaiian" to get a tag, or because "Hawaiian" has already > >> received > >> a tag earlier in the list. Or are there other criteria. > >> > >> If you want to solve this problem with a programme you must first clearly > >> define what makes an unwanted tag "unwanted". [...] > > > >By unwanted I did not mean anything so intricate. > >Unwanted meant things I did not want. > > That much was clear, but you need to specify in your own mind _precisely_ > what > makes some things unwanted and others wanted. Without concrete criteria you > can't write code to implement those criteria. > > I'm not saying "you need to imagine code to match these things": you're > clearly > capable of doing that. I'm saying you need to have well defined concepts of > what makes something unwanted (or, if that is easier to define, wanted). You > can do that iteratively: start with your basic concept and see how well it > works. When those concepts don't give you the outcome you desire, consider a > specific example which isn't working and try to figure out what additional > criterion would let you distinguish it from a working example. > > >For example, > >if my target phrases included terms like, > >government of Mexico, > > > >now in my list I would have words with their tags as, > >government > >of > >Mexico > > > >If I put these words in list it would tag > >government/TAG of/TAG Mexico > > > >but would also tag all the "of" which may be > >anywhere like haze is made of/TAG dense white, > >clouds of/TAG steam, etc. > > > >Cleaning these unwanted places become a daunting task > >to me. > > Richard Damon has pointed out that you seem to want phrases instead of just > words. > > >I have been experimenting around > >wordlist=["Kilauea volcano","Kilauea/TAG > >volcano/TAG"),("Hawaii","Hawaii/TAG"),...] > >tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus) > > > >is giving me sizeably good result but size of the wordlist is slight concern. > > You can reduce that list by generating the "wordlist" form from something > smaller: > > base_phrases = ["Kilauea volcano", "government of Mexico", "Hawaii"] > wordlist = [ > (base_phrase, " ".join([word + "/TAG" for word in base_phrase.split()])) > for base_phrase in base_phrases > ] > > You could even autosplit the longer phrases so that your base_phrases > _automatically_ becomes: > > base_phrases = ["Kilauea volcano", "Kilauea", "volcano", "government of > Mexico", "government", "Mexico", "Hawaii"] > > That way your "replace" call would find the longer phrases before the shorter > phrases and thus _not_ tag the single words if they occurred in a longer > phrase, while still tagging the single words when they _didn't_ land in a > longer phrase. > > Also, it is unclear to me whether "/TAG" is a fixed string or intended to be > distinct such as "/PROPER_NOUN", "/LOCATION" etc. If they vary then you need > a > more elaborate setup. > > It sounds like you want a more general purpose parser, and that depends upon > your purposes. If you're coding to learn the basics of breaking up text, what > you're doing is fine and I'd stick with it. But if you're just after the > outcome (tags), you could use other libraries to break up the text. > > For example, the Natural Language ToolKit (NLTK) will do structured parsing > of > text and return you a syntax tree, and it has many other facilities. Doco: > > http://www.nltk.org/ > > PyPI module: > > https://pypi.org/project/nltk/ > > which you can install with the command: > > pip install --user nltk > > That would get you a tree structure of the corpus, which you could process > more > meaningfully. For example, you could traverse the tree and tag higher level > nodes as you came across them, possibly then _not_ traversing their inner > nodes. The effect of that would be that if you hit the grammatic node: > > government of Mexico > > you might tags that node with "ORGANISATION", and choose not to descend > inside > it, thus avoiding tagging "government" and "of" and so forth because you have > a > high level tags. Nodes not specially recognised you're keep descending into, > tagging smaller things. > > Cheers, > Cameron Simpson
Dear Sir, Thank you for your kind and valuable suggestions. Thank you for your kind time too. I know NLTK and machine learning. I am of belief if I may use language properly we need machine learning-the least. So, I am trying to design a tagger without the help of machine learning, by simple Python coding. I have thus removed standard Parts of Speech(PoS) or Named Entity (NE) tagging scheme. I am trying to design a basic model if required may be implemented on any one of these problems. Detecting longer phrase is slightly a problem now I am thinking to employ re.search(pattern,text). If this part is done I do not need machine learning. Maintaining so much data is a cumbersome issue in machine learning. My regards to all other esteemed coders and members of the group for their kind and valuable time and valuable suggestions. -- https://mail.python.org/mailman/listinfo/python-list