Re: Search Multiple indexes In Solr

2007-11-08 Thread zx zhang
It is said that this new feather will be added in solr1.3, but I am not sure
about that.

I think the following  maybe useful for you:
https://issues.apache.org/jira/browse/SOLR-303
https://issues.apache.org/jira/browse/SOLR-255


2007/11/8, j 90 [EMAIL PROTECTED]:

 Hi, I'm new to Solr but very familiar with Lucene.

 Is there a way to have Solr search in more than once index, much like the
 MultiSearcher in Lucene ?

 If so how so I configure the location of the indexes ?



Re: SOLR 1.2 - Duplicate Documents??

2007-11-08 Thread Yonik Seeley
On Nov 7, 2007 12:30 PM, realw5 [EMAIL PROTECTED] wrote:
 We did have Tomcat crash once (JVM OutOfMem) durning an indexing process,
 could that be a possible source of the issue?

Yes.
Deletes are buffered and carried out in a different phase.

-Yonik


AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Hausherr, Jens
Hi, 

if you just need to preserve the xml for storing you could simply wrap the xml 
markup in CDATA. Splitting your structure beforehand and using dynamic fields 
might be a viable solution...

eg. 
add
  doc
field name=foo1value 1/field
field name=foo2value 2/field

field name=content![CDATA[an xml stream with embedded source 
markup]]/field
  /doc
/add


 

Mit freundlichen Grüßen / Best Regards / Avec mes meilleures salutations

 
Jens Hausherr 
 
Dipl.-Wirtsch.Inf. (Univ.) 
Senior Consultant 
 
Tel: 040-27071-233
Fax: 040-27071-244
Fax: +49-(0)178-998866-097
Mobile: +49-(0)178-8866-097
 
mailto: mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] 
 
Unilog Avinci - a LogicaCMG company
Am Sandtorkai 72
D-20457 Hamburg
http://www.unilog.de http://www.unilog.de/ 
 
Unilog Avinci GmbH
Zettachring 4, 70567 Stuttgart
Amtsgericht Stuttgart HRB 721369
Geschäftsführer: Torsten Straß / Eric Guyot / Rudolf Kuhn / Olaf Scholz
 


This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.


Discovering RequestHandler parameters at runtime

2007-11-08 Thread Grant Ingersoll

Hi,

Is there anyway to interrogate a RequestHandler to discover what  
parameters it supports at runtime?  Kind of like a BeanInfo for  
RequestHandlers?  Has anyone else thought about doing this and what it  
might look like?  Seems like it would be useful for building dynamic  
web forms.


Thanks,
Grant


RE: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Binkley, Peter
I've used eXist for this kind of thing and had good experiences, once I
got a grip on Xquery (which is definitely worth learning). But I've only
used it for small collections (under 10k documents); I gather its
effective ceiling is much lower than Solr's. 

Possibly it will be possible to use Lucene's new payloads to do this
kind of thing (at least, storing Xpath information is one of the
proposed uses: http://lucene.grantingersoll.com/2007/03/18/payloads/ ),
as Erik Hatcher suggested in relation to
https://issues.apache.org/jira/browse/SOLR-380 .

Peter

-Original Message-
From: David Neubert [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 07, 2007 9:52 PM
To: solr-user@lucene.apache.org
Subject: Re: What is the best way to index xml data preserving the mark
up?

Thanks Walter -- 

I am aware of MarkLogic -- and agree -- but I have a very low budget on
licensed software in this case (near 0) -- 

have you used eXists or Xindices? 

Dave

- Original Message 
From: Walter Underwood [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, November 7, 2007 11:37:38 PM
Subject: Re: What is the best way to index xml data preserving the mark
up?

If you really, really need to preserve the XML structure, you'll be
doing a LOT of work to make Solr do that. It might be cheaper to start
with software that already does that. I recommend MarkLogic -- I know
the principals there, and it is some seriously fine software. Not free
or open, but very, very good.

If your problem can be expressed in a flat field model, then the your
problem is mapping your document model into Solr. You might be able to
use structured field names to represent the XML context, but that is
just a guess.

With a mixed corpus of XML and arbitrary text, requiring special
handling of XML, yow, that's a lot of work.

One thought -- you can do flat fields in an XML engine (like MarkLogic)
much more easily than you can do XML in a flat field engine (like
Lucene).

wunder

On 11/7/07 8:18 PM, David Neubert [EMAIL PROTECTED] wrote:

 I am sure this is 101 question, but I am bit confused about indexing
 xml data
 using SOLR.
 
 I have rich xml content (books) that need to searched at granular
 levels
 (specifically paragraph and sentence levels very accurately, no 
 approximations).  My source text has exact p/p and s/s tags
 for this
 purpose.  I have built this app in previous versions (using other
 search
 engines) indexing the text twice, (1) where every paragraph was a
 virtual
 document and (2) where every sentence was a virtual document  -- both 
 extracted from the source file (which was a singe xml file for the
 entire
 book).  I have of course thought about using an XML engine eXists or
 Xindices,
 but I am prefer to the stability and user base and performance that 
 Lucene/SOLR seems to have, and also there is a large body of text
 that is
 regular documents and not well formed XML as well.
 
 I am brand new to SOLR (one day) and at a basic level understand
 SOLR's nice
 simple xml scheme to add documents:
 
 add
   doc
 field name=foo1foo value 1/field
 field name=foo2foo value 2/field
   /doc
   doc.../doc
 /add
 
 But my problem is that I believe I need to perserve the xml markup at
 the
 paragraph and sentence levels, so I was hoping to create a content
 field that
 could just contain the source xml for the paragraph or sentence
 respectively.
 There are reasons for this that I won't go into -- alot of granular
 work in
 this app, accessing pars and sens.
 
 Obviously an XML mechanism that could leverage the xml structure (via
 XPath or
 XPointers) would work great.  Still I think Lucene can do this in a
 field
 level way-- and I also can't imagine that users who are indexing XML
 documents
 have to go through the trouble of striping all the markup before
 indexing?
 Hopefully I missing something basic?
 
 It would be great to pointed in the right direction on this matter?
 
 I think I need something along this line:
 
 add
   doc
 field name=foo1value 1/field
 field name=foo2value 2/field
 
 field name=contentan xml stream with embedded source
 markup/field
   /doc
 /add
 
 Maybe the overall question -- is what is the best way to index XML
 content
 using SOLR -- is all this tag stripping really necessary?
 
 Thanks for any help,
 
 Dave
 
 
 
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 


Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Thanks -- C-Data might be useful -- and I was looking into dynamic fields as 
solution as well -- I think a combination of the two might work.

- Original Message 
From: Hausherr, Jens [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 4:03:02 AM
Subject: AW: What is the best way to index xml data preserving the mark up?


Hi, 

if you just need to preserve the xml for storing you could simply wrap
 the xml markup in CDATA. Splitting your structure beforehand and using
 dynamic fields might be a viable solution...

eg. 
add
  doc
field name=foo1value 1/field
field name=foo2value 2/field

field name=content![CDATA[an xml stream with embedded source
 markup]]/field
  /doc
/add


 

Mit freundlichen Grüßen / Best Regards / Avec mes meilleures
 salutations

 
Jens Hausherr 
 
Dipl.-Wirtsch.Inf. (Univ.) 
Senior Consultant 
 
Tel: 040-27071-233
Fax: 040-27071-244
Fax: +49-(0)178-998866-097
Mobile: +49-(0)178-8866-097
 
mailto: mailto:[EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED] 
 
Unilog Avinci - a LogicaCMG company
Am Sandtorkai 72
D-20457 Hamburg
http://www.unilog.de http://www.unilog.de/ 
 
Unilog Avinci GmbH
Zettachring 4, 70567 Stuttgart
Amtsgericht Stuttgart HRB 721369
Geschäftsführer: Torsten Straß / Eric Guyot / Rudolf Kuhn / Olaf
 Scholz
 


This e-mail and any attachment is for authorised use by the intended
 recipient(s) only. It may contain proprietary material, confidential
 information and/or be subject to legal privilege. It should not be copied,
 disclosed to, retained or used by, any other party. If you are not an
 intended recipient then please promptly delete this e-mail and any
 attachment and all copies and inform the sender. Thank you.





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Discovering RequestHandler parameters at runtime

2007-11-08 Thread Chris Hostetter

:  Is there anyway to interrogate a RequestHandler to discover what parameters
:  it supports at runtime?  Kind of like a BeanInfo for RequestHandlers?  Has

: Also, check:
: http://wiki.apache.org/solr/MakeSolrMoreSelfService

Yeah, that wiki is as far as i ever got.  note that it vastly predates a 
lot of the LukeRequestHandler type stuff and even the general additude of 
moving more towards RequestHandlers as general processing units of solr 
for handling all requests (even admin style requests)

Note that while it might be handy to have something like BeanInfo where 
the *class* tells you what params it supports, the important feature would 
be something where the *instance* tells you what params it supports, 
because it won't want to advertise params that it has invariants set for.  
(i touch on this in that wiki)

Ultimatley i think it would be good if RequestHandlers implemented a 
method that returned a big data structure containing everything they 
wanted to advertise about themselves. and most of the admin screen and 
the form.jsp in the current codebase got replaced by a 
FormRequestHandler that would inspect the SolrCore for a list of all 
RequestHandlers that were advertising themselves and create forms for 
them.

-Hoss



Re: Tomcat JNDI Settings

2007-11-08 Thread Wayne Graham
Hi Hoss,

I just wanted to follow up to the list on this one...I could never get
the JNDI settings to work with Tomcat. I went to Jetty and everything is
working quite nicely.

Wayne

Chris Hostetter wrote:
 : Thanks for getting back to me. The folder /var/lib/tomcat5/solr/home
 : exists as does /var/lib/tomcat5/solr/home/conf/solrconfig.xml. It's
 : basically a copy of the files from examples folder at this point.
 : 
 : I put war files in /var/lib/tomcat5/webapps, so I have the
 : apache-solr-1.2.0.war file outside of the webapps folder.
 : 
 : Are there any special permissions these files need? I have them owned by
 : the tomcat user.
 
 that should be fine ... is /var/lib/tomcat5/solr/home/ writable by the 
 tomcat user so it can make the ./data and ./data/index directories?
 
 are you sure there aren't any other errors in the logs above the one you 
 mentioned already?
 
 
 
 
 -Hoss
 

-- 
/**
 * Wayne Graham
 * Earl Gregg Swem Library
 * PO Box 8794
 * Williamsburg, VA 23188
 * 757.221.3112
 * http://swem.wm.edu/blogs/waynegraham/
 */



Re: Discovering RequestHandler parameters at runtime

2007-11-08 Thread Ryan McKinley

Grant Ingersoll wrote:

Hi,

Is there anyway to interrogate a RequestHandler to discover what 
parameters it supports at runtime?  Kind of like a BeanInfo for 
RequestHandlers?  Has anyone else thought about doing this and what it 
might look like?  Seems like it would be useful for building dynamic web 
forms.




currently there is not...  I started down that route a while ago, but 
got distracted by other things.  I think its a good idea.


Also, check:
http://wiki.apache.org/solr/MakeSolrMoreSelfService

ryan


Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Chris Hostetter

: Thanks -- C-Data might be useful -- and I was looking into dynamic 
: fields as solution as well -- I think a combination of the two might 
: work.

I must admit i haven't been following this thread that closely, so i'm not 
sure how much of the structure of the XML you want to preserve for the 
purposes of querying, or if it's jsut an issue of wanting to store the raw 
XML, but on the the broader topic of indexing/searching arbitrary XML, i'd 
like to through out a few misc ideas i've had in the past that you might 
want to run with...

1) there's a Jira issue i pened a while back with a rough patch for 
applying a user specific XSLTs on the server to transforming arbitrary XML 
into the Solr XML update format (i don't have the issue number handy, and 
my browser is in the throws of death at the moment).  this might solve the 
i want to send solr XML in my own schema, and i want to be able to tell 
it how to pull out various pieces to use as a field values.

2) I was once toying with the idea of an XPathTokenizer.  it would parse 
the fieldValues as XML, then apply arbitrary configured XPath expressions 
against the DOM and use the resulting NodeList to produce the TokenStream.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com



-Hoss



Re: How to do GeoSpatial search in SOLR/Lucene

2007-11-08 Thread Chris Hostetter
: How to do Geo Spatial search in SOLR/Lucene?

i still haven't had a chance to play with any of the good stuff people 
have been talking about, but there have been several recent threads 
talking about it...

http://www.nabble.com/forum/Search.jtp?query=geographiclocal=yforum=14479



-Hoss



Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Tricia Williams

Hi Dave,

This sounds like what I've been trying to work out with 
https://issues.apache.org/jira/browse/SOLR-380.  The idea that I'm 
running with right now is indexing the xml and storing the data in the 
xml tags as a Payload.  Payload is a relatively new idea from  Lucene.  
A custom SolrHighlighter provides position hits (our need for this is 
highlighting on an image while searching the OCR text of the image) and 
some context to where they appear in the document using the stored Payload.


Tricia

David Neubert wrote:

Chris

I'll try to track down your Jira issue.

(2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but know 
what I need -- and basically its to search by the main granules in an xml document, 
with usually turn out to be for books book (rarley), chapter (more often), 
paragraph: (often) sentence: (often).  Then there are niceties like chapter title, 
headings, etc. but I can live without that -- but it seems like if you can exploit 
the text nodes of arbitrary XML you are looking good, if not, you gotta a lot of 
machination in front of you.

Seems like Lucene/SOLR is geared to take record and non-xml-oriented content 
and put it into XML format for ingest -- but really can't digest XML content 
itself at all without significant setup and constraints.  I am surprised -- but 
I could really use it for my project big time.

Another problem I am having related (which I will probably repost separately) 
is boolean searches across fields with multiple values.  At this point, because 
of my work arounds for Lucene (to this point) I am indexing paragraphs as 
single documents with multiple fields, thinking I could copy the sentences to 
text.  In that way, I can search field text (for the paragraph) -- and search 
field sentence -- for sentence granularity.  The problem is that a search for 
sentence:foo AND sentence:bar is matching if foo matches in any sentence of the 
paragraph, and bar also matches in any sentence of the paragraph.  I need it to 
match only if foo and bar are found in the same sentence. If this can't be do, 
looks like I will have to index paragraphs as documents, and redundantly index 
sentences as unique documents. Again, I will post this question separately 
immediately.

Thanks,

Dave
  




Boolean matches in a unique instance of a multi-value field?

2007-11-08 Thread David Neubert


Is it possible to find boolean matches (foo AND bar) in a single unique 
instance of a multi-value field.  So if foo is found in one instance of 
multi-value field, and is also found in another instance of the multi-value 
field -- this WOULD NOT be a match, but only if both words are found in the 
same instance of the multi-value field.

Thanks,

Dave




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Simple sorting questions

2007-11-08 Thread Chris Hostetter

: 1. There appears to be (at least) two ways to specify sorting, one
: involving an append to the q parm and the other using the sort parm.
: Are these exactly equivalent?
: 
:http://localhost/solr/select/?q=martha;author+asc
:http://localhost/solr/select/?q=marthasort=author+asc

They should be, but the first form is heavily deprecated and should not be 
used

: 2. The docs say that sorting can only be applied to non-multivalued
: fields.  Does this mean that sorting won't work *at all* for
: multi-valued fields or only that the behaviour is indeterminate?

The behavior is undefined, in that it might return results in an 
indeterminant order, or it might flat out fail -- it all depends on the 
nature of the data in the field.

Note: it's not specificly that the field must be non-multivalued ... 
even if a field says multiValue=false it still might not be a valid 
field to sort on if it uses an Analyzer that produces multiple tokens per 
field value (so *most* TextField based fields won't work, unless you use 
the KeywordTOkenizer or something equivilent)

: Based on a brief test, sorting a multi-valued field appeared to work
: by picking an arbitrary value when multiple values are present and

as i recall, that will happen when the number of distinct terms indexed 
for that field is less then the number of documents in the index ... but 
if tomorow you add a document that contains a bunch of new terms, and 
shifts the balance so that there are more terms then documents, any search 
attempting to sort on that field will start to fail completly.

(the specifics of why that happens relate to the underlying Lucene 
FieldCache specifics ... i won't bother trying to explain it or deven to 
defend it, because i'm not fond of it at all -- but i haven't thought of 
any easy ways to improve it that don't suffer performance penalties for 
the more common case of people sorting on fields that are ok to sort 
on).




-Hoss



Re: Multiple indexes

2007-11-08 Thread John Reuning
I've had good luck with MultiCore, but you have to sync trunk from svn 
and apply the most recent patch in SOLR-350.


https://issues.apache.org/jira/browse/SOLR-350

-jrr

Jae Joo wrote:

Hi,

I am looking for the way to utilize the multiple indexes for signle sole
instance.
I saw that there is the patch 215  available  and would like to ask someone
who knows how to use multiple indexes.

Thanks,

Jae Joo





Re: Tomcat JNDI Settings

2007-11-08 Thread Chris Hostetter
: I just wanted to follow up to the list on this one...I could never get
: the JNDI settings to work with Tomcat. I went to Jetty and everything is

I'm not sure what to tell you.  

I've been preping my ApacheCon demo for next week using Tomcat and JNDI 
and i haven't had any problems  i've got a few helper scripts that 
save me typing when i set it up (they use sh -x to echo the shell 
commands they execute when they run), but here's everything i do just so 
you can see what i've got going on ... it might help you figure out what's 
not working about your setup.

At the end of all of this Solr is up and running in tomcat using my 
configured SolrHome...

[EMAIL PROTECTED]:/var/tmp/ac-demo$ pwd
/var/tmp/ac-demo
[EMAIL PROTECTED]:/var/tmp/ac-demo$ ls
books-solr-home   demo-links.html raw-data   
tomcat-context.xml
create-tomcat-context.sh  install-tomcat-and-solr.sh  tar-balls
[EMAIL PROTECTED]:/var/tmp/ac-demo$ find books-solr-home/
books-solr-home/
books-solr-home/conf
books-solr-home/conf/xslt
books-solr-home/conf/xslt/example.xsl
books-solr-home/conf/xslt/example_atom.xsl
books-solr-home/conf/schema_minimal.xml
books-solr-home/conf/solrconfig.xml
books-solr-home/conf/synonyms.txt
books-solr-home/conf/schema_books.xml
books-solr-home/conf/schema.xml
[EMAIL PROTECTED]:/var/tmp/ac-demo$ cat tomcat-context.xml

!--

An example of declaring a specific tomcat context file that
points at our solr.war (anywhere we want it) and a Solr Home
directory (any where we want it) using JNDI.

We could have multiple context files like this, with different
names (and different Solr Home settings) to support multiple
indexes on one box.

--
Context
  docBase=/var/tmp/ac-demo/apache-solr-1.2.0/dist/apache-solr-1.2.0.war
  debug=0
  crossContext=true 

  Environment name=solr/home
   value=/var/tmp/ac-demo/books-solr-home/
   type=java.lang.String
   override=true /
/Context
[EMAIL PROTECTED]:/var/tmp/ac-demo$ ./install-tomcat-and-solr.sh
+ cd /var/tmp/ac-demo/
+ tar -xzf tar-balls/apache-tomcat-6.0.14.tar.gz
+ tar -xzf tar-balls/apache-solr-1.2.0.tgz
[EMAIL PROTECTED]:/var/tmp/ac-demo$ ls
apache-solr-1.2.0 books-solr-home   demo-links.html 
raw-data   tomcat-context.xml
apache-tomcat-6.0.14  create-tomcat-context.sh  install-tomcat-and-solr.sh  
tar-balls
[EMAIL PROTECTED]:/var/tmp/ac-demo$ ./create-tomcat-context.sh
+ mkdir -p apache-tomcat-6.0.14/conf/Catalina/localhost/
+ cp tomcat-context.xml 
apache-tomcat-6.0.14/conf/Catalina/localhost/books-solr.xml
[EMAIL PROTECTED]:/var/tmp/ac-demo$ apache-tomcat-6.0.14/bin/catalina.sh
Using CATALINA_BASE:   /var/tmp/ac-demo/apache-tomcat-6.0.14
Using CATALINA_HOME:   /var/tmp/ac-demo/apache-tomcat-6.0.14
Using CATALINA_TMPDIR: /var/tmp/ac-demo/apache-tomcat-6.0.14/temp
Using JRE_HOME:   /opt/jdk1.5
Usage: catalina.sh ( commands ... )
commands:
  debug Start Catalina in a debugger
  debug -security   Debug Catalina with a security manager
  jpda startStart Catalina under JPDA debugger
  run   Start Catalina in the current window
  run -security Start in the current window with security manager
  start Start Catalina in a separate window
  start -security   Start in a separate window with security manager
  stop  Stop Catalina
  stop -force   Stop Catalina (followed by kill -KILL)
  version   What version of tomcat are you running?
[EMAIL PROTECTED]:/var/tmp/ac-demo$ apache-tomcat-6.0.14/bin/catalina.sh start
Using CATALINA_BASE:   /var/tmp/ac-demo/apache-tomcat-6.0.14
Using CATALINA_HOME:   /var/tmp/ac-demo/apache-tomcat-6.0.14
Using CATALINA_TMPDIR: /var/tmp/ac-demo/apache-tomcat-6.0.14/temp
Using JRE_HOME:   /opt/jdk1.5
[EMAIL PROTECTED]:/var/tmp/ac-demo$



Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Chris

I'll try to track down your Jira issue.

(2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but 
know what I need -- and basically its to search by the main granules in an xml 
document, with usually turn out to be for books book (rarley), chapter (more 
often), paragraph: (often) sentence: (often).  Then there are niceties like 
chapter title, headings, etc. but I can live without that -- but it seems like 
if you can exploit the text nodes of arbitrary XML you are looking good, if 
not, you gotta a lot of machination in front of you.

Seems like Lucene/SOLR is geared to take record and non-xml-oriented content 
and put it into XML format for ingest -- but really can't digest XML content 
itself at all without significant setup and constraints.  I am surprised -- but 
I could really use it for my project big time.

Another problem I am having related (which I will probably repost separately) 
is boolean searches across fields with multiple values.  At this point, because 
of my work arounds for Lucene (to this point) I am indexing paragraphs as 
single documents with multiple fields, thinking I could copy the sentences to 
text.  In that way, I can search field text (for the paragraph) -- and search 
field sentence -- for sentence granularity.  The problem is that a search for 
sentence:foo AND sentence:bar is matching if foo matches in any sentence of the 
paragraph, and bar also matches in any sentence of the paragraph.  I need it to 
match only if foo and bar are found in the same sentence. If this can't be do, 
looks like I will have to index paragraphs as documents, and redundantly index 
sentences as unique documents. Again, I will post this question separately 
immediately.

Thanks,

Dave


- Original Message 
From: Chris Hostetter [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 1:19:40 PM
Subject: Re: AW: What is the best way to index xml data preserving the mark up?



: Thanks -- C-Data might be useful -- and I was looking into dynamic 
: fields as solution as well -- I think a combination of the two might 
: work.

I must admit i haven't been following this thread that closely, so i'm
 not 
sure how much of the structure of the XML you want to preserve for
 the 
purposes of querying, or if it's jsut an issue of wanting to store the
 raw 
XML, but on the the broader topic of indexing/searching arbitrary XML,
 i'd 
like to through out a few misc ideas i've had in the past that you
 might 
want to run with...

1) there's a Jira issue i pened a while back with a rough patch for 
applying a user specific XSLTs on the server to transforming arbitrary
 XML 
into the Solr XML update format (i don't have the issue number handy,
 and 
my browser is in the throws of death at the moment).  this might solve
 the 
i want to send solr XML in my own schema, and i want to be able to
 tell 
it how to pull out various pieces to use as a field values.

2) I was once toying with the idea of an XPathTokenizer.  it would
 parse 
the fieldValues as XML, then apply arbitrary configured XPath
 expressions 
against the DOM and use the resulting NodeList to produce the
 TokenStream.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com



-Hoss






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: where to hook in to SOLR to read field-label from functionquery

2007-11-08 Thread Chris Hostetter

: Say I have a custom functionquery MinFloatFunction which takes as its
: arguments an array of valuesources. 
: 
: MinFloatFunction(ValueSource[] sources)
: 
: In my case all these valuesources are the values of a collection of fields.

a ValueSource isn't required to be field specifc (it may already be the 
mathematical combination of other multiple fields) so there is no generic 
way to get the field name form a ValueSource ... but you could define 
your MinFloatFunction only accept FieldCacheSource[] as input ... hmmm, 
ecept that FieldCacheSource doesn't expose the field name.  so instead you 
write...

  public class MyFieldCacheSource extends FieldCacheSource {
public MyFieldCacheSource(String field) {
  super(field);
}
public String getField() {
  return field;
}
  }
  public class MinFloatFunction ... {
public MinFloatFunction(MyFieldCacheSource[] values);
  }


: For this I designed a schema in which each 'row' in the index represents a
: product (indepdent of variants) (which takes care of the 1 variant max) and
: every variant is represented as 2 fields in this row:
: 
: variant_p_* -- represents price (stored / indexed)
: variant_source_*  -- represents the other fields dependent on the
: variant (stored / multivalued)

Note: if you have a lot of varients you may wind up with the same problem 
as described here...

http://www.nabble.com/sorting-on-dynamic-fields---good%2C-bad%2C-neither--tf4694098.html

...because of the underlying FieldCache usage in FieldCacheValueSource


-Hoss



Re: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Thanks, I think storing the XPath is where I will ultimately wind up -- I will 
look into your links recommended below.

Its an interesting debate where the break even point is between Lucene/XPath 
storing XPath info -- utilizing that for lookup and position within DOM 
structures, verse a full fledged XML engine.  Most corporations are in the 
mixed mode -- I am surprised that Lucene (or some other vendor) doesn't really 
focus on handling both easily.  Maybe I just need to clue in on the Lucene way 
of handing XML (which so far it seems to me as you suggest  is a combo using 
dynamic fields and storing XPath info)

Dave


- Original Message 
From: Binkley, Peter [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 11:23:46 AM
Subject: RE: What is the best way to index xml data preserving the mark up?

I've used eXist for this kind of thing and had good experiences, once I
got a grip on Xquery (which is definitely worth learning). But I've
 only
used it for small collections (under 10k documents); I gather its
effective ceiling is much lower than Solr's. 

Possibly it will be possible to use Lucene's new payloads to do this
kind of thing (at least, storing Xpath information is one of the
proposed uses: http://lucene.grantingersoll.com/2007/03/18/payloads/ ),
as Erik Hatcher suggested in relation to
https://issues.apache.org/jira/browse/SOLR-380 .

Peter

-Original Message-
From: David Neubert [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 07, 2007 9:52 PM
To: solr-user@lucene.apache.org
Subject: Re: What is the best way to index xml data preserving the mark
up?

Thanks Walter -- 

I am aware of MarkLogic -- and agree -- but I have a very low budget on
licensed software in this case (near 0) -- 

have you used eXists or Xindices? 

Dave

- Original Message 
From: Walter Underwood [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, November 7, 2007 11:37:38 PM
Subject: Re: What is the best way to index xml data preserving the mark
up?

If you really, really need to preserve the XML structure, you'll be
doing a LOT of work to make Solr do that. It might be cheaper to start
with software that already does that. I recommend MarkLogic -- I know
the principals there, and it is some seriously fine software. Not free
or open, but very, very good.

If your problem can be expressed in a flat field model, then the your
problem is mapping your document model into Solr. You might be able to
use structured field names to represent the XML context, but that is
just a guess.

With a mixed corpus of XML and arbitrary text, requiring special
handling of XML, yow, that's a lot of work.

One thought -- you can do flat fields in an XML engine (like MarkLogic)
much more easily than you can do XML in a flat field engine (like
Lucene).

wunder

On 11/7/07 8:18 PM, David Neubert [EMAIL PROTECTED] wrote:

 I am sure this is 101 question, but I am bit confused about indexing
 xml data
 using SOLR.
 
 I have rich xml content (books) that need to searched at granular
 levels
 (specifically paragraph and sentence levels very accurately, no 
 approximations).  My source text has exact p/p and s/s tags
 for this
 purpose.  I have built this app in previous versions (using other
 search
 engines) indexing the text twice, (1) where every paragraph was a
 virtual
 document and (2) where every sentence was a virtual document  -- both
 
 extracted from the source file (which was a singe xml file for the
 entire
 book).  I have of course thought about using an XML engine eXists or
 Xindices,
 but I am prefer to the stability and user base and performance that 
 Lucene/SOLR seems to have, and also there is a large body of text
 that is
 regular documents and not well formed XML as well.
 
 I am brand new to SOLR (one day) and at a basic level understand
 SOLR's nice
 simple xml scheme to add documents:
 
 add
   doc
 field name=foo1foo value 1/field
 field name=foo2foo value 2/field
   /doc
   doc.../doc
 /add
 
 But my problem is that I believe I need to perserve the xml markup at
 the
 paragraph and sentence levels, so I was hoping to create a content
 field that
 could just contain the source xml for the paragraph or sentence
 respectively.
 There are reasons for this that I won't go into -- alot of granular
 work in
 this app, accessing pars and sens.
 
 Obviously an XML mechanism that could leverage the xml structure (via
 XPath or
 XPointers) would work great.  Still I think Lucene can do this in a
 field
 level way-- and I also can't imagine that users who are indexing XML
 documents
 have to go through the trouble of striping all the markup before
 indexing?
 Hopefully I missing something basic?
 
 It would be great to pointed in the right direction on this matter?
 
 I think I need something along this line:
 
 add
   doc
 field name=foo1value 1/field
 field name=foo2value 2/field
 
 field name=contentan xml stream with embedded 

2Gb process on 32 bits

2007-11-08 Thread Isart Montane

Hi all,

i'm experiencing some trouble when i'm trying to lauch solr with more 
than 1.6GB. My server is a FC5 with 8GB RAM but when I start solr like this


java -Xmx2000m -jar start.jar

i get the following errors:

Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.

I've tried to start a virtual machine like this

java -Xmx2000m -version

but i get the same errors.

I've read there's a kernel limitation for a 32 bits architecture of 2Gb 
per process, and i just wanna know if anybody knows an alternative to 
get a new 64bits server.


Thanks
Isart


Re: Score of exact matches

2007-11-08 Thread Papalagi Pakeha
On 11/6/07, Walter Underwood [EMAIL PROTECTED] wrote:
 This is fairly straightforward and works well with the DisMax
 handler. Indes the text into three different fields with three
 different sets of analyzers. Use something like this in the
 request handler:
 [...]
 str name=qf
   exact^16 noaccent^4 stemmed
 /str

Thanks, that's exactly what I needed. being new to Solr I didn't know
exactly how the filters and analyzers work together. With your hint I
leaned it all and now it works beautifully :-)

PaPa