Guys.

Thanks for the replies. That is exactly the question. How to segment better
before using the sentence parser. If I cannot segment it better, than
another choice is to use the parser for segmentation as I have stated.

Tika is crappy and, no, I do not know the PDFs structures from the
beginning. Also, Tika does not do PDF by itself, instead it uses another
java library for it as I had checked.

So I guess I still need to think on how to fix the text prior as well.



On Tue, Jul 15, 2014 at 5:49 AM, William Colen <[email protected]>
wrote:

> A while back I had a similar problem  while extracting text from HTML using
> Tika. What I did was to hack the Tika HTML parser to extract the text as I
> needed. I can't remember exactly how it was, but as far as I remember Tika
> raises events when it finds a markup (at least a HTML markup), that is not
> handled by default. If you know the structure of the document you are
> reading, you can decide what to do with the markup and maybe change the
> output (adding a space, a line break etc).
>
>
>
> 2014-07-15 5:00 GMT-03:00 Jörn Kottmann <[email protected]>:
>
> > Text extracted from PDFs must often be cleaned up first, e.g.
> > fix tokenization, remove page header/footer, fix hyphenation, detect
> > headlines/titles, etc.
> >
> > If there are fundamental issues with the plain text the OpenNLP
> components
> > trained on cleaned text will not work very well.
> >
> > Jörn
> >
> >
> > On 07/15/2014 05:38 AM, Carlos Scheidecker wrote:
> >
> >> Hello all,
> >>
> >> I have an interesting problem here. More of a challenge.
> >>
> >> I have been doing text cleansing for bad characters and all.
> >>
> >> Then I have another interesting problem.
> >>
> >> Extracted a public PDF with Tika does not necessary mean you will get
> >> clean
> >> text because the original PDF might have different fonts within a
> section
> >> that will cause weird behaviors.
> >>
> >> If you then divide it into Senteces via OpenNLP you will then get some
> >> interesting sentences.
> >>
> >> Trying to parse those sentences then it gets worse.
> >>
> >> I am showing an example bellow and I would like to ask about solutions
> to
> >> it, considering the text can be noisy.
> >>
> >> I do not think that it will be easy to fix the Sentence Parser. Here is
> >> what I think on approaching it:
> >>
> >> Instead, the best way to do is to look at the sentences poorly parsed,
> >> parse them and extract the inner (S) from the parse as separate
> sentences.
> >>
> >> What would you suggest?
> >>
> >> Here is an example of a piece of text extracted with Tika from a public
> >> pdf. This part is what OpenNLP considered to be a sentence:
> >>
> >> ----
> >>
> >> related research DocumentsBrief: Your Next Portal should Be An
> Engagement
> >> WorkplaceFebruary 3, 2014Microsoft Aims sharePoint To The CloudJanuary
> 27,
> >> 2014setting The Technology Foundation For Your social Business And
> >> Collaboration strategyJuly 29, 2013The Forrester wave : enterprise
> social
> >> Platforms, Q2 2014The 13 Providers That Matter Most And How They stack
> >> Upby
> >> rob Koplowitzwith Peter Burris and Nancy Wang2257913JUNE 5, 2014For CIos
> >> The Forrester Wave : Enterprise social Platforms, Q2 2014 2  2014,
> >> Forrester Research, Inc. Reproduction Prohibited June 5, 2014 eNTeRPRIse
> >> sOCIaL PLaTFORM MaRKeT MaTuRes aMID CONsOLIDaTION aND INTegRaTIONThe
> >> enterprise social platform is no longer in its infancy as offerings
> become
> >> increasingly functional.
> >>
> >> ----
> >>
> >> It is now parsed as follows:
> >>
> >>
> >> (S (S (S (NP (VBN related) (NN research) (NNP DocumentsBrief:) (NNP
> Your)
> >> (NNP Next) (NNP Portal)) (VP (MD should) (VP (VB Be) (NP (NP (DT An)
> (NNP
> >> Engagement) (NNP WorkplaceFebruary) (CD 3,) (JJ 2014Microsoft) (NNP
> Aims)
> >> (NN sharePoint)) (PP (TO To) (NP (DT The) (NNP CloudJanuary))))))) (VP
> >> (VBD
> >> 27,) (S (VP (VBG 2014setting) (NP (NP (NP (DT The) (NNP Technology) (NNP
> >> Foundation)) (PP (IN For) (NP (PRP$ Your) (JJ social) (NNP Business))))
> >> (CC
> >> And) (NP (NNP Collaboration))) (PP (RB strategyJuly) (NP (CD 29,) (CD
> >> 2013The) (NNP Forrester) (NN wave))))))) (: :) (S (VP (VB enterprise)
> (NP
> >> (JJ social) (NN Platforms,)) (PP (IN Q2) (NP (NP (DT 2014The) (CD 13)
> (NNS
> >> Providers)) (NP (NP (DT That) (NNP Matter) (JJS Most)) (SBAR (S (CC And)
> >> (SBAR (WHADVP (WRB How)) (S (NP (PRP They)) (VP (VBP stack) (PP (IN
> Upby)
> >> (NP (NP (NN rob)) (PP (IN Koplowitzwith) (NP (NP (NNP Peter) (NNP
> Burris))
> >> (CC and) (NP (NP (NNP Nancy) (NNP Wang2257913JUNE) (CD 5,)) (PP (IN
> >> 2014For) (NP (NP (NNP CIos)) (NP (DT The) (NNP Forrester) (NNP Wave)) (:
> >> :)
> >> (S (NP (NP (NP (NP (NN Enterprise) (JJ social) (NN Platforms,)) (PP (IN
> >> Q2)
> >> (NP (CD 2014) (CD 2) (JJ 2014,) (NNP Forrester) (NNP Research,) (NNP
> Inc.)
> >> (NNP Reproduction) (NNP Prohibited) (NNP June) (CD 5,) (CD 2014) (NN
> >> eNTeRPRIse) (NN sOCIaL)))) (NP (NNP PLaTFORM) (NNP MaRKeT) (NNP MaTuRes)
> >> (NN aMID) (NN CONsOLIDaTION))) (PP (IN aND) (NP (DT INTegRaTIONThe) (NN
> >> enterprise) (JJ social) (NN platform)))) (VP (VP (VBZ is) (ADVP (RB no)
> >> (RB
> >> longer)) (PP (IN in) (NP (PRP$ its) (NN infancy))) (SBAR (IN as) (S (NP
> >> (NNS offerings)) (VP (VBP become) (ADVP (RB increasingly)))))) (VBG
> >> functional.)))))))))))))))))))))
> >>
> >>
> >> Notice that I have more than one (S (S (S
> >>
> >> And then I have the first correct structure as (S (NP ..... (VP.....
> >>
> >> What is the best way to deal with it?
> >>
> >>
> >
>

Reply via email to