Re: How to create a new index

KK Wed, 20 May 2009 06:30:02 -0700

Thank you ag...@john.
This is even better. I don't have to bother about the 3rd argument, right?
I'll use the same one everytime for both registering a new core as well as
adding docs to an existing one.


Thanks,
KK.

On Wed, May 20, 2009 at 6:54 PM, John Byrne <john.by...@propylon.com> wrote:

> Hi KK,
>
> You're welcome!
>
> BTW, I had a quick look at the Javadoc for IndexWriter and noticed this
> constructor:
>
> public IndexWriter(Directory d, Analyzer a)
> "Constructs an IndexWriter for the index in d, first creating it if it does
> not already exist."
>
> I think that might solve your problem and simplify the code a little - I
> think you could just use that constructor every time, because it will only
> create the index if it does not already exist.
>
> -John
>
>
> KK wrote:
>
>> Thanks a lot @John. That solved the problem and the other advice is really
>> helpful. I'd have bumped over that otherwise.
>> This clarifies my doubt, that everytime I've to create a new index just
>> call
>> the indexwriter with "true" thereby creating the directory, then start
>> adding docs with "false" as the 3rd argument instead of "true", right?
>> Lucene is pretty simple and gives you the full control of whatever you are
>> doing. I've been trying to automate the creation of new solr cores for
>> last
>> two days without any luck. Finally today moved to Lucene and it fixed my
>> problem very soon. Thank you all and special thanks to Lucene guys.
>>
>> Thanks,
>> KK.
>>
>> On Wed, May 20, 2009 at 6:28 PM, John Byrne <john.by...@propylon.com>
>> wrote:
>>
>>
>>
>>> I think the problem is that you are creating an new index every time you
>>> add a document:
>>>
>>> IndexWriter writer = new IndexWriter(trueIndexPath, new
>>> StandardAnalyzer(), true);
>>>
>>> The last argument, the boolean 'true' tells IndexWriter to overwrite any
>>> existing index in that directory. If you set that to false, it will not
>>> overwrite the previous index, but will add to it.
>>>
>>> How, then do you create it in the first place? You call the IndexWriter's
>>> constructor once with 'true' as the 3rd argumrent, creating the index,
>>> then
>>> subsequently use 'false'. You could do this in your main method, right
>>> after
>>> you create an instance of SimpleIndexer, but before you call createIndex.
>>>
>>> -John
>>>
>>>
>>>
>>> KK wrote:
>>>
>>>
>>>
>>>> Thank you very much.
>>>> I'm using the one mentioned by @Anshum ..but the problem is that after
>>>> indexing some no of docs what I see is only the last one indexed which
>>>> clearly indicates that the index is getting overwritten. I'm posing my
>>>> simple indexer and searcher herewith. Actually I'm trying to crawl web
>>>> pages
>>>> and add each pages content under a filed called "content" againts a
>>>> field
>>>> called "id" and for this id I'm using the page URL. These are the codes
>>>>
>>>> The indexer:
>>>> --------------------------------------------
>>>> package solrSearch;
>>>>
>>>> import org.apache.lucene.analysis.SimpleAnalyzer;
>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>> import org.apache.lucene.document.Document;
>>>> import org.apache.lucene.document.Field;
>>>> import org.apache.lucene.index.IndexWriter;
>>>>
>>>> public class SimpleIndexer {
>>>>
>>>>  // Base Path to the index directory
>>>>  private static final String baseIndexPath = "/opt/lucene/index/";
>>>>
>>>>
>>>>  public void createIndex(String pageContent, String pageId, String
>>>> coreId)
>>>> throws Exception {
>>>>   String trueIndexPath = baseIndexPath + coreId ;
>>>>   String contentField = "content";
>>>>   String contentId    = "id";
>>>>
>>>>   // Create a writer
>>>>   IndexWriter writer = new IndexWriter(trueIndexPath, new
>>>> StandardAnalyzer(), true);
>>>>
>>>>   System.out.println("Adding page to lucene " + pageId);
>>>>   Document doc = new Document();
>>>>   doc.add(new Field(contentField, pageContent, Field.Store.YES,
>>>> Field.Index.TOKENIZED));
>>>>   doc.add(new Field(contentId, pageId, Field.Store.YES,
>>>> Field.Index.TOKENIZED));
>>>>
>>>>   // Add documents to the index
>>>>   writer.addDocument(doc);
>>>>
>>>>   // Lucene recommends calling optimize upon completion of indexing
>>>>   writer.optimize();
>>>>
>>>>   // clean up
>>>>   writer.close();
>>>>  }
>>>>
>>>>  public static void main(String args[]) throws Exception{
>>>>      SimpleIndexer empIndex = new SimpleIndexer();
>>>>   empIndex.createIndex("this is sample test content", "test0", "core0");
>>>>   System.out.println("Data indexed by lucene");
>>>>  }
>>>>
>>>> }
>>>>
>>>> and the searcher:
>>>> ---------------------------------------
>>>> package solrSearch;
>>>>
>>>> import java.io.FileReader;
>>>> import java.io.IOException;
>>>> import java.io.InputStreamReader;
>>>> import java.util.Date;
>>>>
>>>> import org.apache.lucene.analysis.Analyzer;
>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>> import org.apache.lucene.document.Document;
>>>> import org.apache.lucene.index.FilterIndexReader;
>>>> import org.apache.lucene.index.IndexReader;
>>>> import org.apache.lucene.queryParser.QueryParser;
>>>> import org.apache.lucene.search.HitCollector;
>>>> import org.apache.lucene.search.Hits;
>>>> import org.apache.lucene.search.IndexSearcher;
>>>> import org.apache.lucene.search.Query;
>>>> import org.apache.lucene.search.ScoreDoc;
>>>> import org.apache.lucene.search.Searcher;
>>>> import org.apache.lucene.search.TopDocCollector;
>>>>
>>>> /** Simple command-line based search demo. */
>>>> public class SimpleSearcher {
>>>>   private static final String baseIndexPath = "/opt/lucene/index/" ;
>>>>
>>>>   private void searchIndex(String queryString, String coreId) throws
>>>> Exception{
>>>>       String trueIndexPath = baseIndexPath + coreId;
>>>>       String searchField = "content";
>>>>        IndexSearcher searcher = new IndexSearcher(trueIndexPath);
>>>>       QueryParser queryParser = null;
>>>>       try {
>>>>           queryParser = new QueryParser(searchField, new
>>>> StandardAnalyzer());
>>>>       } catch (Exception ex) {
>>>>            ex.printStackTrace();
>>>>       }
>>>>
>>>>       Query query = queryParser.parse(queryString);
>>>>
>>>>       Hits hits = null;
>>>>       try {
>>>>            hits = searcher.search(query);
>>>>       } catch (Exception ex) {
>>>>            ex.printStackTrace();
>>>>       }
>>>>
>>>>       int hitCount = hits.length();
>>>>       System.out.println("Results found :" + hitCount);
>>>>
>>>>       for (int ix=0; (ix<hitCount && ix<10); ix++) {
>>>>            Document doc = hits.doc(ix);
>>>>           System.out.println(doc.get("id"));
>>>>           System.out.println(doc.get("content"));
>>>>       }
>>>>   }
>>>>
>>>>   public static void main(String args[]) throws Exception{
>>>>        SimpleSearcher searcher = new SimpleSearcher();
>>>>       String queryString = args[0];
>>>>       System.out.println("Quering for :" + queryString);
>>>>       searcher.searchIndex(queryString, "core0");
>>>>   }
>>>>
>>>> }
>>>>
>>>> ---------------
>>>> When I tried intially without having the core0 directory, it
>>>> automatically
>>>> created that. Its fine, but I'm not able to figure what is the issue,
>>>> why
>>>> the data is getting overwritten. Some silly mistakes some where. Can
>>>> some
>>>> one point me that?
>>>> And this is the code snip that I'm using to post to lucene index.
>>>>
>>>> public void postToSolr(String rawText, String pageId) throws Exception{
>>>>       // Which solr core are we posting to???
>>>>       //String solrCoreId = getCoreId(pageId);
>>>>       String coreId = "core0";
>>>>       SimpleIndexer indexer = new SimpleIndexer();
>>>>       indexer.createIndex(rawText, pageId, coreId);
>>>>
>>>>   }
>>>>
>>>> NB: I din't pay attention to change the names , so you might find the
>>>> word
>>>> "solr" here and there. I was using that earlier, but bcoz of lack of
>>>> facility of creating new separate indexes I moved to lucene today only.
>>>> I
>>>> guess trying to crete a new index with non-existing directory will
>>>> automatically create it, which is what i want. Correct me if i'm wrong.
>>>> As
>>>> I
>>>> mentioned earlier for each domain [say www.bcd.co.uk] I want to have a
>>>> separate index and coreId is a map of this URL to a unique number. Do
>>>> let
>>>> me
>>>> know if i'm going wrong anywhere of if you feel it can be done in any
>>>> other
>>>> better way.
>>>>
>>>>
>>>> Thanks,
>>>> KK.
>>>>
>>>>
>>>> On Wed, May 20, 2009 at 4:10 PM, Anshum <ansh...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi KK,
>>>>>
>>>>> Easier still, you could just open the indexwriter with the last (3rd)
>>>>> arguement as true, this way the indexwriter would create a new index as
>>>>> soon
>>>>> as you start indexing. Also, if you just leave the indexWriter without
>>>>> the
>>>>> 3rd arguement, it'd conditionally create a new directory i.e. only if
>>>>> the
>>>>> index dir doesn't exist at that location would it create a new index
>>>>> else
>>>>> it
>>>>> would append to the already existing index at that location.
>>>>> Coming to the 2nd point, if you are talking about the index name, as
>>>>> mentioned by John you could simply use the timestamp as the index name.
>>>>>
>>>>> --
>>>>> Anshum Gupta
>>>>> Naukri Labs!
>>>>> http://ai-cafe.blogspot.com
>>>>>
>>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>>> distinction is yours to draw............
>>>>>
>>>>>
>>>>> On Wed, May 20, 2009 at 3:23 PM, John Byrne <john.by...@propylon.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> You can do this with pure Java. Create a file object with the path you
>>>>>> want, check if it exists, and it not, create it:
>>>>>>
>>>>>> File newIndexDir = new File("/foo/bar")
>>>>>>
>>>>>> if(!newFileDir.exists())   {
>>>>>>
>>>>>>  newDirFile.mkdirs();
>>>>>> }
>>>>>>
>>>>>> The 'mkdirs()' method creates any necessary parent directories.
>>>>>>
>>>>>> If you want to automate the generation of the path itself, then there
>>>>>> are
>>>>>> several ways to do it, but the best way really depends on *why* you're
>>>>>> generating a new index. For instance, you could just create a
>>>>>> timestamped
>>>>>> name, but that name might not be very meaningful.
>>>>>>
>>>>>> Hope that helps!
>>>>>>
>>>>>> -John
>>>>>>
>>>>>> KK wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> How to create a new index? everytime I need to do so , I've to create
>>>>>>> a
>>>>>>> new
>>>>>>> directory and put the path to that, right? how to automate the
>>>>>>> creation
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> of
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>> new directory?
>>>>>>
>>>>>>
>>>>>>> I'm a new user of lucene. Please help me out.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> KK.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>  ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> No virus found in this incoming message.
>>>>>>
>>>>>>
>>>>>>> Checked by AVG - www.avg.com Version: 8.5.339 / Virus Database:
>>>>>>> 270.12.35/2123 - Release Date: 05/19/09 17:59:00
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>  ------------------------------------------------------------------------
>>>>
>>>>
>>>> No virus found in this incoming message.
>>>> Checked by AVG - www.avg.com Version: 8.5.339 / Virus Database:
>>>> 270.12.35/2123 - Release Date: 05/19/09 17:59:00
>>>>
>>>>
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>  ------------------------------------------------------------------------
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com Version: 8.5.339 / Virus Database:
>> 270.12.35/2123 - Release Date: 05/19/09 17:59:00
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: How to create a new index

Reply via email to