[
https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031477#comment-16031477
]
Tilman Hausherr commented on PDFBOX-3804:
-----------------------------------------
The text extraction does not show paragraphs with default settings, but the one
for html does because it has different settings. I tried it and the third page
shows that PDFBox cuts too much, for whatever reason. Surprisingly PDFBox
1.8.13 works better.
Current version:
{quote}
every three years—that situation was swiftly inverted. Today, less than two
percent of all stored information is nondigital.
Given this massive scale, it is tempting to understand big data solely in terms
of size. But that would be misleading. Big data is also characterized by the
ability to render into data many aspects of the world that have never been
quantified before; call it "datafication." For example, location has been
datafied, first with the invention of longitude and latitude, and more recently
with gps satellite systems. Words are treated as data when computers mine
centuries' worth of books. Even friendships and "likes" are datafied, via
Facebook.
This kind of data is being put to incredible new uses with the as sistance of
inexpensive computer memory, powerful processors, smart
algorithms, clever software, and math that borrows from basic statis tics.
Instead of trying to "teach" a computer how to do things, such as
drive a car or translate between languages, which artificial-intelligence
experts have tried unsuccessfully to do for decades, the new approach is to
feed enough data into a computer so that it can infer the proba bility that,
say, a traffic light is green and not red or that, in a certain
context, lumière is a more appropriate substitute for "light" than léger.
Using great volumes of information in this way requires three profound changes
in how we approach data. The first is to collect and use a lot of data rather
than settle for small amounts or samples, as statisticians have done for well
over a century. The second is to shed
our preference for highly curated and pristine data and instead accept
messiness: in an increasing number of situations, a bit of inaccuracy can be
tolerated, because the benefits of using vastly more data of variable quality
outweigh the costs of using smaller amounts of very exact data. Third, in many
instances, we will need to give up our quest to discover the cause of things,
in return for accepting correlations. With big data, instead of trying to
understand precisely why an engine breaks down or why a drug's side effect
disappears, researchers can instead collect and analyze massive quantities of
information about such events and everything that is associated with them,
looking for
patterns that might help predict future occurrences. Big data helps answer
what, not why, and often that's good enough.
The Internet has reshaped how humanity communicates. Big data is different: it
marks a transformation in how society processes
information. In time, big data might change our way of thinking about
the world. As we tap ever more data to understand events and make
{quote}
Version 1.8.13:
{quote}
every three years—that situation was swiftly inverted. Today, less than two
percent of all stored information is nondigital.
Given this massive scale, it is tempting to understand big data solely in terms
of size. But that would be misleading. Big data is also characterized by the
ability to render into data many aspects of the world that have never been
quantified before; call it "datafication." For example, location has been
datafied, first with the invention of longitude and latitude, and more recently
with gps satellite systems. Words are treated as data when computers mine
centuries' worth of books. Even friendships and "likes" are datafied, via
Facebook.
This kind of data is being put to incredible new uses with the as sistance of
inexpensive computer memory, powerful processors, smart algorithms, clever
software, and math that borrows from basic statis tics. Instead of trying to
"teach" a computer how to do things, such as drive a car or translate between
languages, which artificial-intelligence experts have tried unsuccessfully to
do for decades, the new approach is to feed enough data into a computer so that
it can infer the proba bility that, say, a traffic light is green and not red
or that, in a certain context, lumière is a more appropriate substitute for
"light" than léger.
Using great volumes of information in this way requires three profound changes
in how we approach data. The first is to collect and use a lot of data rather
than settle for small amounts or samples, as statisticians have done for well
over a century. The second is to shed our preference for highly curated and
pristine data and instead accept messiness: in an increasing number of
situations, a bit of inaccuracy can be tolerated, because the benefits of using
vastly more data of variable quality outweigh the costs of using smaller
amounts of very exact data. Third, in many instances, we will need to give up
our quest to discover the cause of things, in return for accepting
correlations. With big data, instead of trying to understand precisely why an
engine breaks down or why a drug's side effect disappears, researchers can
instead collect and analyze massive quantities of information about such events
and everything that is associated with them, looking for patterns that might
help predict future occurrences. Big data helps answer what, not why, and often
that's good enough.
The Internet has reshaped how humanity communicates. Big data is different: it
marks a transformation in how society processes information. In time, big data
might change our way of thinking about the world. As we tap ever more data to
understand events and make
{quote}
Is the output of 1.8.13 what you'd like? If yes, please test other files with
it.
Now the question for us is: WTF happened between 1.8 and 2.0?
If the output isn't what you like, then you should write a patch and submit
it... none of us has touched the text extraction algorithms for years.
> Detect end of paragraphs
> ------------------------
>
> Key: PDFBOX-3804
> URL: https://issues.apache.org/jira/browse/PDFBOX-3804
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.6, 2.0.7, 3.0.0
> Reporter: Alexandre
> Labels: extraction, paragraph
> Attachments: example.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward
> improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it
> does not detect correctly end of paragraphs.*
> *What is a paragraph ?* A paragraph is a text that contains one or several
> sentences. It can start by a tabulation but this is not mandatory. In a
> paragraph, there is one or more lines but there is no carriage return (except
> the one at the very end). A paragraph can end before the very end of a line,
> but some paragraphs end at the very end. If a paragraph ends at the very end
> there is no new lines containing words after.
> *So, the last line of a paragraph ends before reaching the very end of the
> line except if there is no new lines containing words after it.* Do you
> follow me ? +And an algorithm could use that pattern to detect properly
> paragraphs.+
> In my opinion, the algorithm should use the following information:
> (*) the +width of the block+ containing the paragraph ;
> (*) precomputed width of the +first word in the next line+.
> The +width of a block+ refers to the width of the area that contains the line
> that contains the character the algorithm is evaluating at any steps.
> The algorithm runs on every character and when it reaches the +last character
> of a line+, it pre computes +the first word of the next line+ to have it's
> width.
> If +this word+ fits in the previous line after the +last character+, then the
> algorithm concludes an end of paragraph (*case 1*).
> If there is no +next word+, then this is also the end of the paragraph (*case
> 2*).
> If there is a tabulation before the +next word+ (*case 3*).
> If the +last character+ is far from the end of the block, we automatically
> conclude for the end of a paragraph (*case 4 is optional*).
> Cheers,
> A.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]