Re: Search Multiple indexes In Solr
It is said that this new feather will be added in solr1.3, but I am not sure about that. I think the following maybe useful for you: https://issues.apache.org/jira/browse/SOLR-303 https://issues.apache.org/jira/browse/SOLR-255 2007/11/8, j 90 [EMAIL PROTECTED]: Hi, I'm new to Solr but very familiar with Lucene. Is there a way to have Solr search in more than once index, much like the MultiSearcher in Lucene ? If so how so I configure the location of the indexes ?
Re: SOLR 1.2 - Duplicate Documents??
On Nov 7, 2007 12:30 PM, realw5 [EMAIL PROTECTED] wrote: We did have Tomcat crash once (JVM OutOfMem) durning an indexing process, could that be a possible source of the issue? Yes. Deletes are buffered and carried out in a different phase. -Yonik
AW: What is the best way to index xml data preserving the mark up?
Hi, if you just need to preserve the xml for storing you could simply wrap the xml markup in CDATA. Splitting your structure beforehand and using dynamic fields might be a viable solution... eg. add doc field name=foo1value 1/field field name=foo2value 2/field field name=content![CDATA[an xml stream with embedded source markup]]/field /doc /add Mit freundlichen Grüßen / Best Regards / Avec mes meilleures salutations Jens Hausherr Dipl.-Wirtsch.Inf. (Univ.) Senior Consultant Tel: 040-27071-233 Fax: 040-27071-244 Fax: +49-(0)178-998866-097 Mobile: +49-(0)178-8866-097 mailto: mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] Unilog Avinci - a LogicaCMG company Am Sandtorkai 72 D-20457 Hamburg http://www.unilog.de http://www.unilog.de/ Unilog Avinci GmbH Zettachring 4, 70567 Stuttgart Amtsgericht Stuttgart HRB 721369 Geschäftsführer: Torsten Straß / Eric Guyot / Rudolf Kuhn / Olaf Scholz This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
Discovering RequestHandler parameters at runtime
Hi, Is there anyway to interrogate a RequestHandler to discover what parameters it supports at runtime? Kind of like a BeanInfo for RequestHandlers? Has anyone else thought about doing this and what it might look like? Seems like it would be useful for building dynamic web forms. Thanks, Grant
RE: What is the best way to index xml data preserving the mark up?
I've used eXist for this kind of thing and had good experiences, once I got a grip on Xquery (which is definitely worth learning). But I've only used it for small collections (under 10k documents); I gather its effective ceiling is much lower than Solr's. Possibly it will be possible to use Lucene's new payloads to do this kind of thing (at least, storing Xpath information is one of the proposed uses: http://lucene.grantingersoll.com/2007/03/18/payloads/ ), as Erik Hatcher suggested in relation to https://issues.apache.org/jira/browse/SOLR-380 . Peter -Original Message- From: David Neubert [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 07, 2007 9:52 PM To: solr-user@lucene.apache.org Subject: Re: What is the best way to index xml data preserving the mark up? Thanks Walter -- I am aware of MarkLogic -- and agree -- but I have a very low budget on licensed software in this case (near 0) -- have you used eXists or Xindices? Dave - Original Message From: Walter Underwood [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, November 7, 2007 11:37:38 PM Subject: Re: What is the best way to index xml data preserving the mark up? If you really, really need to preserve the XML structure, you'll be doing a LOT of work to make Solr do that. It might be cheaper to start with software that already does that. I recommend MarkLogic -- I know the principals there, and it is some seriously fine software. Not free or open, but very, very good. If your problem can be expressed in a flat field model, then the your problem is mapping your document model into Solr. You might be able to use structured field names to represent the XML context, but that is just a guess. With a mixed corpus of XML and arbitrary text, requiring special handling of XML, yow, that's a lot of work. One thought -- you can do flat fields in an XML engine (like MarkLogic) much more easily than you can do XML in a flat field engine (like Lucene). wunder On 11/7/07 8:18 PM, David Neubert [EMAIL PROTECTED] wrote: I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR. I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations). My source text has exact p/p and s/s tags for this purpose. I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document -- both extracted from the source file (which was a singe xml file for the entire book). I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well. I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents: add doc field name=foo1foo value 1/field field name=foo2foo value 2/field /doc doc.../doc /add But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively. There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens. Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great. Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing? Hopefully I missing something basic? It would be great to pointed in the right direction on this matter? I think I need something along this line: add doc field name=foo1value 1/field field name=foo2value 2/field field name=contentan xml stream with embedded source markup/field /doc /add Maybe the overall question -- is what is the best way to index XML content using SOLR -- is all this tag stripping really necessary? Thanks for any help, Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: AW: What is the best way to index xml data preserving the mark up?
Thanks -- C-Data might be useful -- and I was looking into dynamic fields as solution as well -- I think a combination of the two might work. - Original Message From: Hausherr, Jens [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, November 8, 2007 4:03:02 AM Subject: AW: What is the best way to index xml data preserving the mark up? Hi, if you just need to preserve the xml for storing you could simply wrap the xml markup in CDATA. Splitting your structure beforehand and using dynamic fields might be a viable solution... eg. add doc field name=foo1value 1/field field name=foo2value 2/field field name=content![CDATA[an xml stream with embedded source markup]]/field /doc /add Mit freundlichen Grüßen / Best Regards / Avec mes meilleures salutations Jens Hausherr Dipl.-Wirtsch.Inf. (Univ.) Senior Consultant Tel: 040-27071-233 Fax: 040-27071-244 Fax: +49-(0)178-998866-097 Mobile: +49-(0)178-8866-097 mailto: mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] Unilog Avinci - a LogicaCMG company Am Sandtorkai 72 D-20457 Hamburg http://www.unilog.de http://www.unilog.de/ Unilog Avinci GmbH Zettachring 4, 70567 Stuttgart Amtsgericht Stuttgart HRB 721369 Geschäftsführer: Torsten Straß / Eric Guyot / Rudolf Kuhn / Olaf Scholz This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Discovering RequestHandler parameters at runtime
: Is there anyway to interrogate a RequestHandler to discover what parameters : it supports at runtime? Kind of like a BeanInfo for RequestHandlers? Has : Also, check: : http://wiki.apache.org/solr/MakeSolrMoreSelfService Yeah, that wiki is as far as i ever got. note that it vastly predates a lot of the LukeRequestHandler type stuff and even the general additude of moving more towards RequestHandlers as general processing units of solr for handling all requests (even admin style requests) Note that while it might be handy to have something like BeanInfo where the *class* tells you what params it supports, the important feature would be something where the *instance* tells you what params it supports, because it won't want to advertise params that it has invariants set for. (i touch on this in that wiki) Ultimatley i think it would be good if RequestHandlers implemented a method that returned a big data structure containing everything they wanted to advertise about themselves. and most of the admin screen and the form.jsp in the current codebase got replaced by a FormRequestHandler that would inspect the SolrCore for a list of all RequestHandlers that were advertising themselves and create forms for them. -Hoss
Re: Tomcat JNDI Settings
Hi Hoss, I just wanted to follow up to the list on this one...I could never get the JNDI settings to work with Tomcat. I went to Jetty and everything is working quite nicely. Wayne Chris Hostetter wrote: : Thanks for getting back to me. The folder /var/lib/tomcat5/solr/home : exists as does /var/lib/tomcat5/solr/home/conf/solrconfig.xml. It's : basically a copy of the files from examples folder at this point. : : I put war files in /var/lib/tomcat5/webapps, so I have the : apache-solr-1.2.0.war file outside of the webapps folder. : : Are there any special permissions these files need? I have them owned by : the tomcat user. that should be fine ... is /var/lib/tomcat5/solr/home/ writable by the tomcat user so it can make the ./data and ./data/index directories? are you sure there aren't any other errors in the logs above the one you mentioned already? -Hoss -- /** * Wayne Graham * Earl Gregg Swem Library * PO Box 8794 * Williamsburg, VA 23188 * 757.221.3112 * http://swem.wm.edu/blogs/waynegraham/ */
Re: Discovering RequestHandler parameters at runtime
Grant Ingersoll wrote: Hi, Is there anyway to interrogate a RequestHandler to discover what parameters it supports at runtime? Kind of like a BeanInfo for RequestHandlers? Has anyone else thought about doing this and what it might look like? Seems like it would be useful for building dynamic web forms. currently there is not... I started down that route a while ago, but got distracted by other things. I think its a good idea. Also, check: http://wiki.apache.org/solr/MakeSolrMoreSelfService ryan
Re: AW: What is the best way to index xml data preserving the mark up?
: Thanks -- C-Data might be useful -- and I was looking into dynamic : fields as solution as well -- I think a combination of the two might : work. I must admit i haven't been following this thread that closely, so i'm not sure how much of the structure of the XML you want to preserve for the purposes of querying, or if it's jsut an issue of wanting to store the raw XML, but on the the broader topic of indexing/searching arbitrary XML, i'd like to through out a few misc ideas i've had in the past that you might want to run with... 1) there's a Jira issue i pened a while back with a rough patch for applying a user specific XSLTs on the server to transforming arbitrary XML into the Solr XML update format (i don't have the issue number handy, and my browser is in the throws of death at the moment). this might solve the i want to send solr XML in my own schema, and i want to be able to tell it how to pull out various pieces to use as a field values. 2) I was once toying with the idea of an XPathTokenizer. it would parse the fieldValues as XML, then apply arbitrary configured XPath expressions against the DOM and use the resulting NodeList to produce the TokenStream. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -Hoss
Re: How to do GeoSpatial search in SOLR/Lucene
: How to do Geo Spatial search in SOLR/Lucene? i still haven't had a chance to play with any of the good stuff people have been talking about, but there have been several recent threads talking about it... http://www.nabble.com/forum/Search.jtp?query=geographiclocal=yforum=14479 -Hoss
Re: AW: What is the best way to index xml data preserving the mark up?
Hi Dave, This sounds like what I've been trying to work out with https://issues.apache.org/jira/browse/SOLR-380. The idea that I'm running with right now is indexing the xml and storing the data in the xml tags as a Payload. Payload is a relatively new idea from Lucene. A custom SolrHighlighter provides position hits (our need for this is highlighting on an image while searching the OCR text of the image) and some context to where they appear in the document using the stored Payload. Tricia David Neubert wrote: Chris I'll try to track down your Jira issue. (2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but know what I need -- and basically its to search by the main granules in an xml document, with usually turn out to be for books book (rarley), chapter (more often), paragraph: (often) sentence: (often). Then there are niceties like chapter title, headings, etc. but I can live without that -- but it seems like if you can exploit the text nodes of arbitrary XML you are looking good, if not, you gotta a lot of machination in front of you. Seems like Lucene/SOLR is geared to take record and non-xml-oriented content and put it into XML format for ingest -- but really can't digest XML content itself at all without significant setup and constraints. I am surprised -- but I could really use it for my project big time. Another problem I am having related (which I will probably repost separately) is boolean searches across fields with multiple values. At this point, because of my work arounds for Lucene (to this point) I am indexing paragraphs as single documents with multiple fields, thinking I could copy the sentences to text. In that way, I can search field text (for the paragraph) -- and search field sentence -- for sentence granularity. The problem is that a search for sentence:foo AND sentence:bar is matching if foo matches in any sentence of the paragraph, and bar also matches in any sentence of the paragraph. I need it to match only if foo and bar are found in the same sentence. If this can't be do, looks like I will have to index paragraphs as documents, and redundantly index sentences as unique documents. Again, I will post this question separately immediately. Thanks, Dave
Boolean matches in a unique instance of a multi-value field?
Is it possible to find boolean matches (foo AND bar) in a single unique instance of a multi-value field. So if foo is found in one instance of multi-value field, and is also found in another instance of the multi-value field -- this WOULD NOT be a match, but only if both words are found in the same instance of the multi-value field. Thanks, Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Simple sorting questions
: 1. There appears to be (at least) two ways to specify sorting, one : involving an append to the q parm and the other using the sort parm. : Are these exactly equivalent? : :http://localhost/solr/select/?q=martha;author+asc :http://localhost/solr/select/?q=marthasort=author+asc They should be, but the first form is heavily deprecated and should not be used : 2. The docs say that sorting can only be applied to non-multivalued : fields. Does this mean that sorting won't work *at all* for : multi-valued fields or only that the behaviour is indeterminate? The behavior is undefined, in that it might return results in an indeterminant order, or it might flat out fail -- it all depends on the nature of the data in the field. Note: it's not specificly that the field must be non-multivalued ... even if a field says multiValue=false it still might not be a valid field to sort on if it uses an Analyzer that produces multiple tokens per field value (so *most* TextField based fields won't work, unless you use the KeywordTOkenizer or something equivilent) : Based on a brief test, sorting a multi-valued field appeared to work : by picking an arbitrary value when multiple values are present and as i recall, that will happen when the number of distinct terms indexed for that field is less then the number of documents in the index ... but if tomorow you add a document that contains a bunch of new terms, and shifts the balance so that there are more terms then documents, any search attempting to sort on that field will start to fail completly. (the specifics of why that happens relate to the underlying Lucene FieldCache specifics ... i won't bother trying to explain it or deven to defend it, because i'm not fond of it at all -- but i haven't thought of any easy ways to improve it that don't suffer performance penalties for the more common case of people sorting on fields that are ok to sort on). -Hoss
Re: Multiple indexes
I've had good luck with MultiCore, but you have to sync trunk from svn and apply the most recent patch in SOLR-350. https://issues.apache.org/jira/browse/SOLR-350 -jrr Jae Joo wrote: Hi, I am looking for the way to utilize the multiple indexes for signle sole instance. I saw that there is the patch 215 available and would like to ask someone who knows how to use multiple indexes. Thanks, Jae Joo
Re: Tomcat JNDI Settings
: I just wanted to follow up to the list on this one...I could never get : the JNDI settings to work with Tomcat. I went to Jetty and everything is I'm not sure what to tell you. I've been preping my ApacheCon demo for next week using Tomcat and JNDI and i haven't had any problems i've got a few helper scripts that save me typing when i set it up (they use sh -x to echo the shell commands they execute when they run), but here's everything i do just so you can see what i've got going on ... it might help you figure out what's not working about your setup. At the end of all of this Solr is up and running in tomcat using my configured SolrHome... [EMAIL PROTECTED]:/var/tmp/ac-demo$ pwd /var/tmp/ac-demo [EMAIL PROTECTED]:/var/tmp/ac-demo$ ls books-solr-home demo-links.html raw-data tomcat-context.xml create-tomcat-context.sh install-tomcat-and-solr.sh tar-balls [EMAIL PROTECTED]:/var/tmp/ac-demo$ find books-solr-home/ books-solr-home/ books-solr-home/conf books-solr-home/conf/xslt books-solr-home/conf/xslt/example.xsl books-solr-home/conf/xslt/example_atom.xsl books-solr-home/conf/schema_minimal.xml books-solr-home/conf/solrconfig.xml books-solr-home/conf/synonyms.txt books-solr-home/conf/schema_books.xml books-solr-home/conf/schema.xml [EMAIL PROTECTED]:/var/tmp/ac-demo$ cat tomcat-context.xml !-- An example of declaring a specific tomcat context file that points at our solr.war (anywhere we want it) and a Solr Home directory (any where we want it) using JNDI. We could have multiple context files like this, with different names (and different Solr Home settings) to support multiple indexes on one box. -- Context docBase=/var/tmp/ac-demo/apache-solr-1.2.0/dist/apache-solr-1.2.0.war debug=0 crossContext=true Environment name=solr/home value=/var/tmp/ac-demo/books-solr-home/ type=java.lang.String override=true / /Context [EMAIL PROTECTED]:/var/tmp/ac-demo$ ./install-tomcat-and-solr.sh + cd /var/tmp/ac-demo/ + tar -xzf tar-balls/apache-tomcat-6.0.14.tar.gz + tar -xzf tar-balls/apache-solr-1.2.0.tgz [EMAIL PROTECTED]:/var/tmp/ac-demo$ ls apache-solr-1.2.0 books-solr-home demo-links.html raw-data tomcat-context.xml apache-tomcat-6.0.14 create-tomcat-context.sh install-tomcat-and-solr.sh tar-balls [EMAIL PROTECTED]:/var/tmp/ac-demo$ ./create-tomcat-context.sh + mkdir -p apache-tomcat-6.0.14/conf/Catalina/localhost/ + cp tomcat-context.xml apache-tomcat-6.0.14/conf/Catalina/localhost/books-solr.xml [EMAIL PROTECTED]:/var/tmp/ac-demo$ apache-tomcat-6.0.14/bin/catalina.sh Using CATALINA_BASE: /var/tmp/ac-demo/apache-tomcat-6.0.14 Using CATALINA_HOME: /var/tmp/ac-demo/apache-tomcat-6.0.14 Using CATALINA_TMPDIR: /var/tmp/ac-demo/apache-tomcat-6.0.14/temp Using JRE_HOME: /opt/jdk1.5 Usage: catalina.sh ( commands ... ) commands: debug Start Catalina in a debugger debug -security Debug Catalina with a security manager jpda startStart Catalina under JPDA debugger run Start Catalina in the current window run -security Start in the current window with security manager start Start Catalina in a separate window start -security Start in a separate window with security manager stop Stop Catalina stop -force Stop Catalina (followed by kill -KILL) version What version of tomcat are you running? [EMAIL PROTECTED]:/var/tmp/ac-demo$ apache-tomcat-6.0.14/bin/catalina.sh start Using CATALINA_BASE: /var/tmp/ac-demo/apache-tomcat-6.0.14 Using CATALINA_HOME: /var/tmp/ac-demo/apache-tomcat-6.0.14 Using CATALINA_TMPDIR: /var/tmp/ac-demo/apache-tomcat-6.0.14/temp Using JRE_HOME: /opt/jdk1.5 [EMAIL PROTECTED]:/var/tmp/ac-demo$
Re: AW: What is the best way to index xml data preserving the mark up?
Chris I'll try to track down your Jira issue. (2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but know what I need -- and basically its to search by the main granules in an xml document, with usually turn out to be for books book (rarley), chapter (more often), paragraph: (often) sentence: (often). Then there are niceties like chapter title, headings, etc. but I can live without that -- but it seems like if you can exploit the text nodes of arbitrary XML you are looking good, if not, you gotta a lot of machination in front of you. Seems like Lucene/SOLR is geared to take record and non-xml-oriented content and put it into XML format for ingest -- but really can't digest XML content itself at all without significant setup and constraints. I am surprised -- but I could really use it for my project big time. Another problem I am having related (which I will probably repost separately) is boolean searches across fields with multiple values. At this point, because of my work arounds for Lucene (to this point) I am indexing paragraphs as single documents with multiple fields, thinking I could copy the sentences to text. In that way, I can search field text (for the paragraph) -- and search field sentence -- for sentence granularity. The problem is that a search for sentence:foo AND sentence:bar is matching if foo matches in any sentence of the paragraph, and bar also matches in any sentence of the paragraph. I need it to match only if foo and bar are found in the same sentence. If this can't be do, looks like I will have to index paragraphs as documents, and redundantly index sentences as unique documents. Again, I will post this question separately immediately. Thanks, Dave - Original Message From: Chris Hostetter [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, November 8, 2007 1:19:40 PM Subject: Re: AW: What is the best way to index xml data preserving the mark up? : Thanks -- C-Data might be useful -- and I was looking into dynamic : fields as solution as well -- I think a combination of the two might : work. I must admit i haven't been following this thread that closely, so i'm not sure how much of the structure of the XML you want to preserve for the purposes of querying, or if it's jsut an issue of wanting to store the raw XML, but on the the broader topic of indexing/searching arbitrary XML, i'd like to through out a few misc ideas i've had in the past that you might want to run with... 1) there's a Jira issue i pened a while back with a rough patch for applying a user specific XSLTs on the server to transforming arbitrary XML into the Solr XML update format (i don't have the issue number handy, and my browser is in the throws of death at the moment). this might solve the i want to send solr XML in my own schema, and i want to be able to tell it how to pull out various pieces to use as a field values. 2) I was once toying with the idea of an XPathTokenizer. it would parse the fieldValues as XML, then apply arbitrary configured XPath expressions against the DOM and use the resulting NodeList to produce the TokenStream. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -Hoss __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: where to hook in to SOLR to read field-label from functionquery
: Say I have a custom functionquery MinFloatFunction which takes as its : arguments an array of valuesources. : : MinFloatFunction(ValueSource[] sources) : : In my case all these valuesources are the values of a collection of fields. a ValueSource isn't required to be field specifc (it may already be the mathematical combination of other multiple fields) so there is no generic way to get the field name form a ValueSource ... but you could define your MinFloatFunction only accept FieldCacheSource[] as input ... hmmm, ecept that FieldCacheSource doesn't expose the field name. so instead you write... public class MyFieldCacheSource extends FieldCacheSource { public MyFieldCacheSource(String field) { super(field); } public String getField() { return field; } } public class MinFloatFunction ... { public MinFloatFunction(MyFieldCacheSource[] values); } : For this I designed a schema in which each 'row' in the index represents a : product (indepdent of variants) (which takes care of the 1 variant max) and : every variant is represented as 2 fields in this row: : : variant_p_* -- represents price (stored / indexed) : variant_source_* -- represents the other fields dependent on the : variant (stored / multivalued) Note: if you have a lot of varients you may wind up with the same problem as described here... http://www.nabble.com/sorting-on-dynamic-fields---good%2C-bad%2C-neither--tf4694098.html ...because of the underlying FieldCache usage in FieldCacheValueSource -Hoss
Re: What is the best way to index xml data preserving the mark up?
Thanks, I think storing the XPath is where I will ultimately wind up -- I will look into your links recommended below. Its an interesting debate where the break even point is between Lucene/XPath storing XPath info -- utilizing that for lookup and position within DOM structures, verse a full fledged XML engine. Most corporations are in the mixed mode -- I am surprised that Lucene (or some other vendor) doesn't really focus on handling both easily. Maybe I just need to clue in on the Lucene way of handing XML (which so far it seems to me as you suggest is a combo using dynamic fields and storing XPath info) Dave - Original Message From: Binkley, Peter [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, November 8, 2007 11:23:46 AM Subject: RE: What is the best way to index xml data preserving the mark up? I've used eXist for this kind of thing and had good experiences, once I got a grip on Xquery (which is definitely worth learning). But I've only used it for small collections (under 10k documents); I gather its effective ceiling is much lower than Solr's. Possibly it will be possible to use Lucene's new payloads to do this kind of thing (at least, storing Xpath information is one of the proposed uses: http://lucene.grantingersoll.com/2007/03/18/payloads/ ), as Erik Hatcher suggested in relation to https://issues.apache.org/jira/browse/SOLR-380 . Peter -Original Message- From: David Neubert [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 07, 2007 9:52 PM To: solr-user@lucene.apache.org Subject: Re: What is the best way to index xml data preserving the mark up? Thanks Walter -- I am aware of MarkLogic -- and agree -- but I have a very low budget on licensed software in this case (near 0) -- have you used eXists or Xindices? Dave - Original Message From: Walter Underwood [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, November 7, 2007 11:37:38 PM Subject: Re: What is the best way to index xml data preserving the mark up? If you really, really need to preserve the XML structure, you'll be doing a LOT of work to make Solr do that. It might be cheaper to start with software that already does that. I recommend MarkLogic -- I know the principals there, and it is some seriously fine software. Not free or open, but very, very good. If your problem can be expressed in a flat field model, then the your problem is mapping your document model into Solr. You might be able to use structured field names to represent the XML context, but that is just a guess. With a mixed corpus of XML and arbitrary text, requiring special handling of XML, yow, that's a lot of work. One thought -- you can do flat fields in an XML engine (like MarkLogic) much more easily than you can do XML in a flat field engine (like Lucene). wunder On 11/7/07 8:18 PM, David Neubert [EMAIL PROTECTED] wrote: I am sure this is 101 question, but I am bit confused about indexing xml data using SOLR. I have rich xml content (books) that need to searched at granular levels (specifically paragraph and sentence levels very accurately, no approximations). My source text has exact p/p and s/s tags for this purpose. I have built this app in previous versions (using other search engines) indexing the text twice, (1) where every paragraph was a virtual document and (2) where every sentence was a virtual document -- both extracted from the source file (which was a singe xml file for the entire book). I have of course thought about using an XML engine eXists or Xindices, but I am prefer to the stability and user base and performance that Lucene/SOLR seems to have, and also there is a large body of text that is regular documents and not well formed XML as well. I am brand new to SOLR (one day) and at a basic level understand SOLR's nice simple xml scheme to add documents: add doc field name=foo1foo value 1/field field name=foo2foo value 2/field /doc doc.../doc /add But my problem is that I believe I need to perserve the xml markup at the paragraph and sentence levels, so I was hoping to create a content field that could just contain the source xml for the paragraph or sentence respectively. There are reasons for this that I won't go into -- alot of granular work in this app, accessing pars and sens. Obviously an XML mechanism that could leverage the xml structure (via XPath or XPointers) would work great. Still I think Lucene can do this in a field level way-- and I also can't imagine that users who are indexing XML documents have to go through the trouble of striping all the markup before indexing? Hopefully I missing something basic? It would be great to pointed in the right direction on this matter? I think I need something along this line: add doc field name=foo1value 1/field field name=foo2value 2/field field name=contentan xml stream with embedded
2Gb process on 32 bits
Hi all, i'm experiencing some trouble when i'm trying to lauch solr with more than 1.6GB. My server is a FC5 with 8GB RAM but when I start solr like this java -Xmx2000m -jar start.jar i get the following errors: Error occurred during initialization of VM Could not reserve enough space for object heap Could not create the Java virtual machine. I've tried to start a virtual machine like this java -Xmx2000m -version but i get the same errors. I've read there's a kernel limitation for a 32 bits architecture of 2Gb per process, and i just wanna know if anybody knows an alternative to get a new 64bits server. Thanks Isart
Re: Score of exact matches
On 11/6/07, Walter Underwood [EMAIL PROTECTED] wrote: This is fairly straightforward and works well with the DisMax handler. Indes the text into three different fields with three different sets of analyzers. Use something like this in the request handler: [...] str name=qf exact^16 noaccent^4 stemmed /str Thanks, that's exactly what I needed. being new to Solr I didn't know exactly how the filters and analyzers work together. With your hint I leaned it all and now it works beautifully :-) PaPa