Guys. Thanks for the replies. That is exactly the question. How to segment better before using the sentence parser. If I cannot segment it better, than another choice is to use the parser for segmentation as I have stated.
Tika is crappy and, no, I do not know the PDFs structures from the beginning. Also, Tika does not do PDF by itself, instead it uses another java library for it as I had checked. So I guess I still need to think on how to fix the text prior as well. On Tue, Jul 15, 2014 at 5:49 AM, William Colen <[email protected]> wrote: > A while back I had a similar problem while extracting text from HTML using > Tika. What I did was to hack the Tika HTML parser to extract the text as I > needed. I can't remember exactly how it was, but as far as I remember Tika > raises events when it finds a markup (at least a HTML markup), that is not > handled by default. If you know the structure of the document you are > reading, you can decide what to do with the markup and maybe change the > output (adding a space, a line break etc). > > > > 2014-07-15 5:00 GMT-03:00 Jörn Kottmann <[email protected]>: > > > Text extracted from PDFs must often be cleaned up first, e.g. > > fix tokenization, remove page header/footer, fix hyphenation, detect > > headlines/titles, etc. > > > > If there are fundamental issues with the plain text the OpenNLP > components > > trained on cleaned text will not work very well. > > > > Jörn > > > > > > On 07/15/2014 05:38 AM, Carlos Scheidecker wrote: > > > >> Hello all, > >> > >> I have an interesting problem here. More of a challenge. > >> > >> I have been doing text cleansing for bad characters and all. > >> > >> Then I have another interesting problem. > >> > >> Extracted a public PDF with Tika does not necessary mean you will get > >> clean > >> text because the original PDF might have different fonts within a > section > >> that will cause weird behaviors. > >> > >> If you then divide it into Senteces via OpenNLP you will then get some > >> interesting sentences. > >> > >> Trying to parse those sentences then it gets worse. > >> > >> I am showing an example bellow and I would like to ask about solutions > to > >> it, considering the text can be noisy. > >> > >> I do not think that it will be easy to fix the Sentence Parser. Here is > >> what I think on approaching it: > >> > >> Instead, the best way to do is to look at the sentences poorly parsed, > >> parse them and extract the inner (S) from the parse as separate > sentences. > >> > >> What would you suggest? > >> > >> Here is an example of a piece of text extracted with Tika from a public > >> pdf. This part is what OpenNLP considered to be a sentence: > >> > >> ---- > >> > >> related research DocumentsBrief: Your Next Portal should Be An > Engagement > >> WorkplaceFebruary 3, 2014Microsoft Aims sharePoint To The CloudJanuary > 27, > >> 2014setting The Technology Foundation For Your social Business And > >> Collaboration strategyJuly 29, 2013The Forrester wave : enterprise > social > >> Platforms, Q2 2014The 13 Providers That Matter Most And How They stack > >> Upby > >> rob Koplowitzwith Peter Burris and Nancy Wang2257913JUNE 5, 2014For CIos > >> The Forrester Wave : Enterprise social Platforms, Q2 2014 2 2014, > >> Forrester Research, Inc. Reproduction Prohibited June 5, 2014 eNTeRPRIse > >> sOCIaL PLaTFORM MaRKeT MaTuRes aMID CONsOLIDaTION aND INTegRaTIONThe > >> enterprise social platform is no longer in its infancy as offerings > become > >> increasingly functional. > >> > >> ---- > >> > >> It is now parsed as follows: > >> > >> > >> (S (S (S (NP (VBN related) (NN research) (NNP DocumentsBrief:) (NNP > Your) > >> (NNP Next) (NNP Portal)) (VP (MD should) (VP (VB Be) (NP (NP (DT An) > (NNP > >> Engagement) (NNP WorkplaceFebruary) (CD 3,) (JJ 2014Microsoft) (NNP > Aims) > >> (NN sharePoint)) (PP (TO To) (NP (DT The) (NNP CloudJanuary))))))) (VP > >> (VBD > >> 27,) (S (VP (VBG 2014setting) (NP (NP (NP (DT The) (NNP Technology) (NNP > >> Foundation)) (PP (IN For) (NP (PRP$ Your) (JJ social) (NNP Business)))) > >> (CC > >> And) (NP (NNP Collaboration))) (PP (RB strategyJuly) (NP (CD 29,) (CD > >> 2013The) (NNP Forrester) (NN wave))))))) (: :) (S (VP (VB enterprise) > (NP > >> (JJ social) (NN Platforms,)) (PP (IN Q2) (NP (NP (DT 2014The) (CD 13) > (NNS > >> Providers)) (NP (NP (DT That) (NNP Matter) (JJS Most)) (SBAR (S (CC And) > >> (SBAR (WHADVP (WRB How)) (S (NP (PRP They)) (VP (VBP stack) (PP (IN > Upby) > >> (NP (NP (NN rob)) (PP (IN Koplowitzwith) (NP (NP (NNP Peter) (NNP > Burris)) > >> (CC and) (NP (NP (NNP Nancy) (NNP Wang2257913JUNE) (CD 5,)) (PP (IN > >> 2014For) (NP (NP (NNP CIos)) (NP (DT The) (NNP Forrester) (NNP Wave)) (: > >> :) > >> (S (NP (NP (NP (NP (NN Enterprise) (JJ social) (NN Platforms,)) (PP (IN > >> Q2) > >> (NP (CD 2014) (CD 2) (JJ 2014,) (NNP Forrester) (NNP Research,) (NNP > Inc.) > >> (NNP Reproduction) (NNP Prohibited) (NNP June) (CD 5,) (CD 2014) (NN > >> eNTeRPRIse) (NN sOCIaL)))) (NP (NNP PLaTFORM) (NNP MaRKeT) (NNP MaTuRes) > >> (NN aMID) (NN CONsOLIDaTION))) (PP (IN aND) (NP (DT INTegRaTIONThe) (NN > >> enterprise) (JJ social) (NN platform)))) (VP (VP (VBZ is) (ADVP (RB no) > >> (RB > >> longer)) (PP (IN in) (NP (PRP$ its) (NN infancy))) (SBAR (IN as) (S (NP > >> (NNS offerings)) (VP (VBP become) (ADVP (RB increasingly)))))) (VBG > >> functional.))))))))))))))))))))) > >> > >> > >> Notice that I have more than one (S (S (S > >> > >> And then I have the first correct structure as (S (NP ..... (VP..... > >> > >> What is the best way to deal with it? > >> > >> > > >
