Re: [Dspace-tech] DSpace 1.4.2 search - how the indices are built and what all the files represent?

Allen Lam Thu, 08 Jan 2009 19:28:49 -0800

Hi, Susan,

First of all, for normal uses it is not necessary to put index-all in a 
cron job.


Re-indexing is necessary if you have done changes to the database 
directly (e.g. running SQL commands to del or insert data) or when you 
change the indexing options in dspace.cfg.

Normal adding/deleting/editing items using dspace's web interface or api 
or shell script does not need doing re-indexing afterward because the 
index will be incrementally updated automatically in each operation.

The purpose of index-all is to build an index of data to help speeding 
up searching. Data is built into a tree structure. The files you see 
under /dspace/search are records of the search tree. They are in binary 
format, containing pointers to data and fragments of data.

Indexing is a strategy to trade space for time. You use up some hard 
disk spaces to store index file in order that you gain high speed in 
searching. If you are really interested in knowing the theories of 
indexing, have some reading on relational database design and indexing.

When you decide to re-index all indexing must start from scratch.

In dspace.cfg, you can change the indexing behavior by adding or 
removing or reordering the lines
search.index.X = XXX.YYY

Remove indexes that your systems do not need. Put important indexes in 
the front.

Having a script to run for over two days is really annoying. You may 
wish to check for any bottle neck in your hardware or network. Is your 
CPU or network overloaded? Is there enough memory? Is the database too 
busy...

In my case, I have about 30k of items. Index-all takes 1-2 hours to finish.

Regards,
Allen Lam.
HKU Hub Administrator, http://hub.hku.hk


Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
> Hi,
> 
>      Could one of the developers or someone else who knows the answer, 
> give me a short explanation of what happens in index-all when the search 
> indices are built?  What exactly is happening?  Where is the data that 
> ends up in the indices, extracted from (the .txt files I’m assuming)…??  
> How do your configuration parameters in dspace.cfg affect the building 
> of the indices?  What are the best parameters to use with 
> org.dspace.search.DSIndexer so that the least amount of work has to be 
> done to update the indices (it appears that we may not be using the 
> correct/best parameters with this program in index-all since it looks 
> like all the files under /dspace/search are being deleted are completely 
> recreated from scratch each time we run the job, therefore it’s taking 
> index-all 2 ½ days to run (and our repository is getting bigger and 
> bigger every day).
> 
>  
> 
>      Another couple of questions…what determines how many files get 
> built in /dspace/search and what, if anything, should I be able to tell 
> about my search configuration (in dspace.cfg) by looking at the files 
> under /dspace/search?
> 
>  
> 
>      We are having some problems with our online search in DSpace 
> 1.4.2.  For one thing, as mentioned above, our index-all cron is taking 
> over 2 days to run to completion.  Is this normal for a repository that 
> currently has approximately 130,000 items in it?  By the way, here are 
> our configuration parameters for searching DSpace, from dspace.cfg:
> 
>  
> 
> ##### Search settings #####
> 
>  
> 
> # Where to put search index files
> 
> search.dir = ${dspace.dir}/search
> 
>  
> 
> # Higher values of search.max-clauses will enable prefix searches to work on
> 
> # large repositories
> 
> # search.max-clauses = 2048
> 
> search.max-clauses = 102400
> 
>  
> 
> # Which Lucene Analyzer implementation to use.  If this is omitted or
> 
> # commented out, the standard DSpace analyzer (designed for English)
> 
> # is used by default.
> 
> # search.analyzer = org.dspace.search.DSAnalyzer
> 
>  
> 
> # Chinese analyzer
> 
> # search.analyzer = org.apache.lucene.analysis.cn.ChineseAnalyzer
> 
>  
> 
> # Boolean search operator to use, current supported values are OR and AND
> 
> # If this config item is missing or commented out, OR is used
> 
> # AND requires all search terms to be present
> 
> # OR requires one or more search terms to be present
> 
> search.operator = AND
> 
>  
> 
>  ##### Fulltext Indexing settings #####
> 
> # Maximum number of terms indexed for a single field in Lucene.
> 
> # Default is 10,000 words - often not enough for full-text indexing.
> 
> # If you change this, you'll need to re-index for the change
> 
> # to take effect on previously added items.
> 
> # -1 = unlimited (Integer.MAX_VALUE)
> 
> search.maxfieldlength = -1
> 
>  
> 
> ##### Fields to Index for Search #####
> 
>  
> 
> # DC metadata elements.qualifiers to be indexed for search
> 
> # format: - search.index.[number] = [search field]:element.qualifier
> 
> #         - * used as wildcard
> 
>  
> 
> ###      changing these will change your search results,     ###
> 
> ###  but will NOT automatically change your search displays  ###
> 
>  
> 
> #search.index.1 = author:dc.contributor.*
> 
> #search.index.2 = author:dc.creator.*
> 
> #search.index.3 = title:dc.title.*
> 
> #search.index.4 = keyword:dc.subject.keywords
> 
> #search.index.5 = abstract:dc.description.abstract
> 
> #search.index.6 = description:dc.description.*                
> 
> #search.index.7 = identifier:dc.identifier.*    
> 
> #search.index.6 = 
> author:dc.description.statementofresponsibility                    
> 
>  
> 
> search.index.1 = author:dc.contributor.*
> 
> search.index.2 = author:dc.creator.*
> 
> search.index.3 = title:dc.title.*
> 
> search.index.4 = keyword:dc.subject.*
> 
> search.index.5 = abstract:dc.description.abstract
> 
> search.index.6 = identifier:dc.identifier.titleControlKey
> 
> search.index.7 = series:dc.relation.ispartofseries
> 
> search.index.8 = abstract:dc.description.tableofcontents
> 
> search.index.9 = mime:dc.format.mimetype
> 
> search.index.10 = sponsor:dc.description.sponsorship
> 
> search.index.11 = identifier:dc.identifier.*
> 
> search.index.12 = language:dc.language.iso
> 
>  
> 
>      Finally, can anyone point me to some good documentation on the 
> search parameters in dspace.cfg in 1.4.2 and how to set them in order to 
> maximize your search capabilities and the integrity of search results?
> 
>  
> 
> Thanks in advance,
> 
> Sue
> 
>  
> 
> **Sue Walker-Thornton**
> 
> **ConITS Contract***
> ***NASA Langley Research Center***
> ***Integrated Library Systems Application & Database Administrator**
> 
> **130 Research Drive**
> 
> **Hampton, VA  23666**
> 
> **Office: (757) 224-4074***
> ***Fax:    (757) 224-4001***
> ***Pager: (757) 988-2547**** **
> **Email:**** **/*/ /*//*/susan.m.thorn...@nasa.gov/*/ 
> <mailto:susan.m.thorn...@nasa.gov>
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> ------------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It is the best place to buy or sell services for
> just about anything Open Source.
> http://p.sf.net/sfu/Xq1LFB
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


------------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It is the best place to buy or sell services for
just about anything Open Source.
http://p.sf.net/sfu/Xq1LFB
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] DSpace 1.4.2 search - how the indices are built and what all the files represent?

Reply via email to