Re: Keyword extraction from pdf to text

Ian Lea Tue, 30 Nov 2010 09:45:15 -0800

If I've understood you correctly, you want to pump text into a lucene
Analyzer and grab the output and do something else with that.  If that
is right, you can use code based on something like this:


        for (String s : array-of-input-texts) {
            Analyzer anl = new xxxAnalyzer(whatever);
            TokenStream stream = anl.tokenStream("", new StringReader(s));
            TermAttribute term = (TermAttribute)
stream.addAttribute(TermAttribute.class);
            while (stream.incrementToken()) {
              System.out.println(" "+term.term());
            }
        }

There is code in the second edition of Lucene In Action called
something like AnalyzerUtils that may have inspired the code above.
Reading a copy of that book is an excellent idea for any lucene
related topic.


--
Ian.


On Tue, Nov 30, 2010 at 5:09 PM, McGibbney, Lewis John
<lewis.mcgibb...@gcu.ac.uk> wrote:
> Hello list,
>
> I am currently attempting to extract keywords from pdf documents, my aim is 
> then to begin constructing a domain ontology using the words which are 
> extracted. I do not need to index anything at this stage, but wish to extract 
> and push the output as plain text into a text file. An example of input text 
> from the pdf document would be as follows
> ________________________________
> 6.1.3 Calculating carbon dioxide emissions for the proposed dwelling
> The second calculation involves establishing the carbon dioxide emissions
> for the proposed dwelling (DER). To do this the values proposed for the
> dwelling should be used in the methodology i.e. the U-values, air 
> infiltration,
> heating system, etc.
> The exceptions to entering the dwelling specific values are:
> a. it may be assumed that all glazing is orientated east/west;
> b. average overshading may be assumed if not known. 'Very little' shading
> should not be entered;
> c. 2 sheltered sides should be assumed if not known. More than 2 sheltered
> sides should not be entered;
> d. where secondary heating is proposed, if a chimney or flue is present but
> no appliance installed the worst case should be assumed i.e. a decorative
> fuel-effect gas appliance with 20% efficiency. If there is no gas point, an
> open fire with 37% efficiency should be assumed, burning solid mineral
> fuel for dwellings outwith a smokeless zone and smokeless solid mineral
> fuel for those that are within such a zone.
> All other values can be varied, but before entering values into the
> methodology, reference should be made to:
> * the back-stop U-values identified in guidance to standard 6.2; and
> * guidance on systems and equipment within standards 6.3 to 6.6.
> ________________________________
> My requirements are as follows
>
>
> *         drop stop words
>
> *         be able to pick up Bi Grams or NGrams such as the following 
> "U-Values", "super-insulated", "air infiltration" etc,
>
> *         lower case filter
>
> I have currently been using Lucene 3.0.1 with a custom filter to achieve the 
> above bullet points, then using Luke to pick up phrases and entities from 
> text by looking into the generated index, however I found that this was very 
> time consuming. My intention is to pass the pdf document as input and receive 
> the above as output which I can then use to manually construct my ontology 
> from entities and their relationships.
>
> I previously posted this to the Tika list with no response, so again I 
> apologise if this is not a problem for the Lucene java list. Can anyone 
> suggest a possible solution to the problem.
>
> Any help would be great ;0) Thanks
>
> Lewis
>
>
> Glasgow Caledonian University is a registered Scottish charity, number 
> SC021474
>
> Winner: Times Higher Education's Widening Participation Initiative of the 
> Year 2009 and Herald Society's Education Initiative of the Year 2009
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Keyword extraction from pdf to text

Reply via email to