[jira] [Commented] (PDFBOX-3804) Detect end of paragraphs

Tilman Hausherr (JIRA) Wed, 31 May 2017 09:40:34 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031477#comment-16031477
 ]


Tilman Hausherr commented on PDFBOX-3804:
-----------------------------------------

The text extraction does not show paragraphs with default settings, but the one 
for html does because it has different settings. I tried it and the third page 
shows that PDFBox cuts too much, for whatever reason. Surprisingly PDFBox 
1.8.13 works better.

Current version:
{quote}
every three years—that situation was swiftly inverted. Today, less than two 
percent of all stored information is nondigital.

Given this massive scale, it is tempting to understand big data solely in terms 
of size. But that would be misleading. Big data is also characterized by the 
ability to render into data many aspects of the world that have never been 
quantified before; call it "datafication." For example, location has been 
datafied, first with the invention of longitude and latitude, and more recently 
with gps satellite systems. Words are treated as data when computers mine 
centuries' worth of books. Even friendships and "likes" are datafied, via 
Facebook.

This kind of data is being put to incredible new uses with the as sistance of 
inexpensive computer memory, powerful processors, smart

algorithms, clever software, and math that borrows from basic statis tics. 
Instead of trying to "teach" a computer how to do things, such as

drive a car or translate between languages, which artificial-intelligence

experts have tried unsuccessfully to do for decades, the new approach is to 
feed enough data into a computer so that it can infer the proba bility that, 
say, a traffic light is green and not red or that, in a certain

context, lumière is a more appropriate substitute for "light" than léger.

Using great volumes of information in this way requires three profound changes 
in how we approach data. The first is to collect and use a lot of data rather 
than settle for small amounts or samples, as statisticians have done for well 
over a century. The second is to shed

our preference for highly curated and pristine data and instead accept

messiness: in an increasing number of situations, a bit of inaccuracy can be 
tolerated, because the benefits of using vastly more data of variable quality 
outweigh the costs of using smaller amounts of very exact data. Third, in many 
instances, we will need to give up our quest to discover the cause of things, 
in return for accepting correlations. With big data, instead of trying to 
understand precisely why an engine breaks down or why a drug's side effect 
disappears, researchers can instead collect and analyze massive quantities of 
information about such events and everything that is associated with them, 
looking for

patterns that might help predict future occurrences. Big data helps answer 
what, not why, and often that's good enough.

The Internet has reshaped how humanity communicates. Big data is different: it 
marks a transformation in how society processes

information. In time, big data might change our way of thinking about

the world. As we tap ever more data to understand events and make 
{quote}

Version 1.8.13:
{quote}
every three years—that situation was swiftly inverted. Today, less than two 
percent of all stored information is nondigital.

Given this massive scale, it is tempting to understand big data solely in terms 
of size. But that would be misleading. Big data is also characterized by the 
ability to render into data many aspects of the world that have never been 
quantified before; call it "datafication." For example, location has been 
datafied, first with the invention of longitude and latitude, and more recently 
with gps satellite systems. Words are treated as data when computers mine 
centuries' worth of books. Even friendships and "likes" are datafied, via 
Facebook.

This kind of data is being put to incredible new uses with the as sistance of 
inexpensive computer memory, powerful processors, smart algorithms, clever 
software, and math that borrows from basic statis tics. Instead of trying to 
"teach" a computer how to do things, such as drive a car or translate between 
languages, which artificial-intelligence experts have tried unsuccessfully to 
do for decades, the new approach is to feed enough data into a computer so that 
it can infer the proba bility that, say, a traffic light is green and not red 
or that, in a certain context, lumière is a more appropriate substitute for 
"light" than léger.

Using great volumes of information in this way requires three profound changes 
in how we approach data. The first is to collect and use a lot of data rather 
than settle for small amounts or samples, as statisticians have done for well 
over a century. The second is to shed our preference for highly curated and 
pristine data and instead accept messiness: in an increasing number of 
situations, a bit of inaccuracy can be tolerated, because the benefits of using 
vastly more data of variable quality outweigh the costs of using smaller 
amounts of very exact data. Third, in many instances, we will need to give up 
our quest to discover the cause of things, in return for accepting 
correlations. With big data, instead of trying to understand precisely why an 
engine breaks down or why a drug's side effect disappears, researchers can 
instead collect and analyze massive quantities of information about such events 
and everything that is associated with them, looking for patterns that might 
help predict future occurrences. Big data helps answer what, not why, and often 
that's good enough.

The Internet has reshaped how humanity communicates. Big data is different: it 
marks a transformation in how society processes information. In time, big data 
might change our way of thinking about the world. As we tap ever more data to 
understand events and make 
{quote}

Is the output of 1.8.13 what you'd like? If yes, please test other files with 
it.

Now the question for us is: WTF happened between 1.8 and 2.0?

If the output isn't what you like, then you should write a patch and submit 
it... none of us has touched the text extraction algorithms for years.

> Detect end of paragraphs
> ------------------------
>
>                 Key: PDFBOX-3804
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3804
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.6, 2.0.7, 3.0.0
>            Reporter: Alexandre
>              Labels: extraction, paragraph
>         Attachments: example.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward 
> improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it 
> does not detect correctly end of paragraphs.*
> *What is a paragraph ?* A paragraph is a text that contains one or several 
> sentences. It can start by a tabulation but this is not mandatory. In a 
> paragraph, there is one or more lines but there is no carriage return (except 
> the one at the very end). A paragraph can end before the very end of a line, 
> but some paragraphs end at the very end. If a paragraph ends at the very end 
> there is no new lines containing words after.
> *So, the last line of a paragraph ends before reaching the very end of the 
> line except if there is no new lines containing words after it.* Do you 
> follow me ? +And an algorithm could use that pattern to detect properly 
> paragraphs.+ 
> In my opinion, the algorithm should use the following information:
> (*) the +width of the block+ containing the paragraph ;
> (*) precomputed width of the +first word in the next line+.
> The +width of a block+ refers to the width of the area that contains the line 
> that contains the character the algorithm is evaluating at any steps.
> The algorithm runs on every character and when it reaches the +last character 
> of a line+, it pre computes +the first word of the next line+ to have it's 
> width.
> If +this word+ fits in the previous line after the +last character+, then the 
> algorithm concludes an end of paragraph (*case 1*).
> If there is no +next word+, then this is also the end of the paragraph (*case 
> 2*).
> If there is a tabulation before the +next word+ (*case 3*).
> If the +last character+ is far from the end of the block, we automatically 
> conclude for the end of a paragraph (*case 4 is optional*).
> Cheers,
> A.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3804) Detect end of paragraphs

Reply via email to