Hello list, I am currently attempting to extract keywords from pdf documents, my aim is then to begin constructing a domain ontology using the words which are extracted. I do not need to index anything at this stage, but wish to extract and push the output as plain text into a text file. An example of input text from the pdf document would be as follows ________________________________ 6.1.3 Calculating carbon dioxide emissions for the proposed dwelling The second calculation involves establishing the carbon dioxide emissions for the proposed dwelling (DER). To do this the values proposed for the dwelling should be used in the methodology i.e. the U-values, air infiltration, heating system, etc. The exceptions to entering the dwelling specific values are: a. it may be assumed that all glazing is orientated east/west; b. average overshading may be assumed if not known. 'Very little' shading should not be entered; c. 2 sheltered sides should be assumed if not known. More than 2 sheltered sides should not be entered; d. where secondary heating is proposed, if a chimney or flue is present but no appliance installed the worst case should be assumed i.e. a decorative fuel-effect gas appliance with 20% efficiency. If there is no gas point, an open fire with 37% efficiency should be assumed, burning solid mineral fuel for dwellings outwith a smokeless zone and smokeless solid mineral fuel for those that are within such a zone. All other values can be varied, but before entering values into the methodology, reference should be made to: * the back-stop U-values identified in guidance to standard 6.2; and * guidance on systems and equipment within standards 6.3 to 6.6. ________________________________ My requirements are as follows
* drop stop words * be able to pick up Bi Grams or NGrams such as the following "U-Values", "super-insulated", "air infiltration" etc, * lower case filter I have currently been using Lucene 3.0.1 with a custom filter to achieve the above bullet points, then using Luke to pick up phrases and entities from text by looking into the generated index, however I found that this was very time consuming. My intention is to pass the pdf document as input and receive the above as output which I can then use to manually construct my ontology from entities and their relationships. I previously posted this to the Tika list with no response, so again I apologise if this is not a problem for the Lucene java list. Can anyone suggest a possible solution to the problem. Any help would be great ;0) Thanks Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education's Widening Participation Initiative of the Year 2009 and Herald Society's Education Initiative of the Year 2009 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html