If I've understood you correctly, you want to pump text into a lucene Analyzer and grab the output and do something else with that. If that is right, you can use code based on something like this:
for (String s : array-of-input-texts) { Analyzer anl = new xxxAnalyzer(whatever); TokenStream stream = anl.tokenStream("", new StringReader(s)); TermAttribute term = (TermAttribute) stream.addAttribute(TermAttribute.class); while (stream.incrementToken()) { System.out.println(" "+term.term()); } } There is code in the second edition of Lucene In Action called something like AnalyzerUtils that may have inspired the code above. Reading a copy of that book is an excellent idea for any lucene related topic. -- Ian. On Tue, Nov 30, 2010 at 5:09 PM, McGibbney, Lewis John <lewis.mcgibb...@gcu.ac.uk> wrote: > Hello list, > > I am currently attempting to extract keywords from pdf documents, my aim is > then to begin constructing a domain ontology using the words which are > extracted. I do not need to index anything at this stage, but wish to extract > and push the output as plain text into a text file. An example of input text > from the pdf document would be as follows > ________________________________ > 6.1.3 Calculating carbon dioxide emissions for the proposed dwelling > The second calculation involves establishing the carbon dioxide emissions > for the proposed dwelling (DER). To do this the values proposed for the > dwelling should be used in the methodology i.e. the U-values, air > infiltration, > heating system, etc. > The exceptions to entering the dwelling specific values are: > a. it may be assumed that all glazing is orientated east/west; > b. average overshading may be assumed if not known. 'Very little' shading > should not be entered; > c. 2 sheltered sides should be assumed if not known. More than 2 sheltered > sides should not be entered; > d. where secondary heating is proposed, if a chimney or flue is present but > no appliance installed the worst case should be assumed i.e. a decorative > fuel-effect gas appliance with 20% efficiency. If there is no gas point, an > open fire with 37% efficiency should be assumed, burning solid mineral > fuel for dwellings outwith a smokeless zone and smokeless solid mineral > fuel for those that are within such a zone. > All other values can be varied, but before entering values into the > methodology, reference should be made to: > * the back-stop U-values identified in guidance to standard 6.2; and > * guidance on systems and equipment within standards 6.3 to 6.6. > ________________________________ > My requirements are as follows > > > * drop stop words > > * be able to pick up Bi Grams or NGrams such as the following > "U-Values", "super-insulated", "air infiltration" etc, > > * lower case filter > > I have currently been using Lucene 3.0.1 with a custom filter to achieve the > above bullet points, then using Luke to pick up phrases and entities from > text by looking into the generated index, however I found that this was very > time consuming. My intention is to pass the pdf document as input and receive > the above as output which I can then use to manually construct my ontology > from entities and their relationships. > > I previously posted this to the Tika list with no response, so again I > apologise if this is not a problem for the Lucene java list. Can anyone > suggest a possible solution to the problem. > > Any help would be great ;0) Thanks > > Lewis > > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education's Widening Participation Initiative of the > Year 2009 and Herald Society's Education Initiative of the Year 2009 > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org