Keyword extraction from pdf to text

McGibbney, Lewis John Tue, 30 Nov 2010 09:09:37 -0800

Hello list,

I am currently attempting to extract keywords from pdf documents, my aim is 
then to begin constructing a domain ontology using the words which are 
extracted. I do not need to index anything at this stage, but wish to extract 
and push the output as plain text into a text file. An example of input text 
from the pdf document would be as follows
________________________________
6.1.3 Calculating carbon dioxide emissions for the proposed dwelling
The second calculation involves establishing the carbon dioxide emissions
for the proposed dwelling (DER). To do this the values proposed for the
dwelling should be used in the methodology i.e. the U-values, air infiltration,
heating system, etc.
The exceptions to entering the dwelling specific values are:
a. it may be assumed that all glazing is orientated east/west;
b. average overshading may be assumed if not known. 'Very little' shading
should not be entered;
c. 2 sheltered sides should be assumed if not known. More than 2 sheltered
sides should not be entered;
d. where secondary heating is proposed, if a chimney or flue is present but
no appliance installed the worst case should be assumed i.e. a decorative
fuel-effect gas appliance with 20% efficiency. If there is no gas point, an
open fire with 37% efficiency should be assumed, burning solid mineral
fuel for dwellings outwith a smokeless zone and smokeless solid mineral
fuel for those that are within such a zone.
All other values can be varied, but before entering values into the
methodology, reference should be made to:
* the back-stop U-values identified in guidance to standard 6.2; and
* guidance on systems and equipment within standards 6.3 to 6.6.
________________________________
My requirements are as follows



*         drop stop words

*         be able to pick up Bi Grams or NGrams such as the following 
"U-Values", "super-insulated", "air infiltration" etc,

*         lower case filter

I have currently been using Lucene 3.0.1 with a custom filter to achieve the 
above bullet points, then using Luke to pick up phrases and entities from text 
by looking into the generated index, however I found that this was very time 
consuming. My intention is to pass the pdf document as input and receive the 
above as output which I can then use to manually construct my ontology from 
entities and their relationships.

I previously posted this to the Tika list with no response, so again I 
apologise if this is not a problem for the Lucene java list. Can anyone suggest 
a possible solution to the problem.

Any help would be great ;0) Thanks

Lewis


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 
2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Keyword extraction from pdf to text

Reply via email to