Re: Ant: Re: Adding XML searching with Lucene

Stefano Mazzocchi Tue, 11 Dec 2001 03:42:32 -0800

Bernhard Huber wrote:
> 
> Hi,
> 
> Using the avalon components might help to speed up the searching, as I
> changed the classes to Recyclable,
> and corrected a bug in the IndexReaderCache -giving me a
> TooManyOpenedFiles exception.
> As there will be a lot of clients doing search, it is important to have
> a fast search, hence:
> The indexReader is like a JdbcConnection, pooling would speed up. Only
> in case of the changing the index it
> is neccessary to recreate the indexReader.


Good point.

> >Why don't you throw in your skeleton ideas here and we discuss then in
> >the open?
> >
> Okay, perhaps i have misunderstood something.
> 
> >>* I will implement some paging for the search result, if there are too
> >>much search result for displaying on a single page.
> >>
> >
> >Yep, this is a must do.
> >
> I have done this but still using the old package names.
> 
> I added a LuceneCocoonPager (I know the names...) class, doing the hits
> per page calculation, and wrapping the Hits class. You will find it in
> the attachment plus the modified searchindex.xsp.
> 
> If searchindex.xsp stays I'd like to have some xsp-stylesheet for doing
> the reendering of the paging stuff.
> Is there some easy way not having to declare the logicsheet in the
> cocoon.xconf?

not that I know of.

> For the developing I'd like
> to declare the logicsheet inside the xsp itself.

don't think it's possible on the current system.
 
> This paging stuff should go into the
> 
> org.apache.cocoon.generator.SearchGenerator, too.
> This way the generator is able to generate only the search result which will be 
>displayed.

I agree.

> >>* I will study the Main class for the internal crawling..
> >>
> >
> >Great
> >
> Okay, it got an overview using the environment.commandline.* classes.
> Now i have a question about crawling&indexing:
> 
> As it is now I have a xsp to trigger the crawling&indexing. It uses http
> URLs to access the xml-content for indexing.
> Now to speed up I see following possibilities:
> 
> First still staying in an servlet-context environment:
> * For Servlet 2.3 something like this might work:
> RequestDispatch rd = servletContext.getRequestDispatch(
> "/cocoon/documents/index.html?cocoon-view=content" );
> rd.include( new_request_wrapper, new_response_wrapper );
> new_response_wrapper should hold the xml-content.
> 
> For Cocoon in Servlet 2.2, and higher:
> I want to access the Cocoon instance of the current servlet-context. I
> don't want to create another
> Cocoon instance for sake of performance, and memory-consumption.
> 
> If I have to create a new Cocoon instance, I see following choices:
> 
> * create an Cocoon instance like the org.apache.cocoon.Main and try to
> grap the right configs, etc like the servlet-engine Cocoon instance. How
> could I assert to get the right configs?
> * create an Cocoon instance simulating an servlet-environment.
> Can you give some hints about implementing the easiest solution.

Cocoon is an avalon component.

My best choice would be to retrieve Cocoon as a component directly from
the ComponentManager, then call the process(Environment) method
indicating what environment we want, just like the Main class does.

> For the commandline only crawling, and indexing I see following choices:
> * Implement something like the org.apache.cocoon.Main for the crawling,
> and indexing. Same here I will
> grap the same config like the servlet-engine Cocoon instance.
> * Additional adding an Ant wrapper:
> <taskdef name="cocoon-index"
> class="org.apache.cocoon.optional.ant.CocoonIndexTask"/>
> <cocoon-index
>   index-directory="/a/c/index"
>   create="yes"
>   analyzer="org.apache.lucene.analyzer.StandardAnalyzer"
>   uri="index.html"
>   contextDir="${build.context}"
>   destDir="${build.dir}/ant-test/docs"
>   workDir="${build.dir}/ant-test/work"
>   logLevel="INFO">
> </cocoon-index>

 
> * Now should there be some Cocoon Ant datatype for making it more easy
> to create an Cocoon instance? like:
>   <cocoon-index
>     index-directory="/a/c/index"
>     create="yes"
>     analyzer="org.apache.lucene.analyzer.StandardAnalyzer"
>     uri="index.html">
>   <cocoon
>     contextDir="${build.context}"
>     destDir="${build.dir}/ant-test/docs"
>     workDir="${build.dir}/ant-test/work"
>     logLevel="INFO"/>
>   </cocoon-index>

hmmm, might connect Ant to Cocoon too strongly but I really don't know.
What do others think about this?
 
> * Apropos Ant wrapper I was implementing an Ant wrapper for the Main
> class by extending the Ant class Java, and it works fine, calling the
> Main.main() from a forked java.
> Thus creating the cocoon documents:
> ...
>     <taskdef name="cocoon"
> classname="org.apache.cocoon.optional.ant.CocoonJavaTask">
>       <classpath>
>         <path refid="classpath"/>
>       </classpath>
>     </taskdef>
> 
>     <cocoon
>       contextDir="${build.context}"
>       destDir="${build.dir}/ant-test/docs"
>       workDir="${build.dir}/ant-test/work"
>       logLevel="INFO"
>       uri="index.html"
>     >
>       <classpath>
>         <path refid="classpath"/>
>       </classpath>
>     </cocoon>
> ...
> But I failed to call it setting fork=false, getting some
> ClassNotFoundException. Now I wonder the ServletEngine has solved this
> somehow....

Sounds like a classloading containment problem. Ant is not as advanced
on classloading like Tomcat is.
 
> * Having a command line, or Ant wrapped indexing, and crawling the last
> open issues is to invoke that via some time-service, some
> ApplicationServer like WLS offers that, and I think there is some
> Cron-Service in the Avalon-System. Does it makes sense to add the
> Avalon-Cron service into a simple Servlet-Engine?

I think so.

> 
> >searching for 'cocoon' would result in something like:
> >
> > <search:results>
> >  <search:hit rank="1" score="89%" uri="...">
> >   <xhtml:p>
> >    <search:highlight>Cocoon</search:highlight> now offers semantic
> ><search:highlight>search</search:highlight>
> >   </xhtml:p>
> >  </search:hit>
> >  ...
> > </search:results>
> >
> >As you can see, this also includes part of the "context" where the
> >textual information is found. This follows the Google model and I think
> >it would be a *great* feature to have.
> >
> This is possible if you change the lucene API a bit.
> There was some posting in lucene mailing list regarding highlightning. I
> don't know about the state of that approvement. Anyway the highlightning
> needs some changes in the lucene API, i have modified "my"
> lucene to be able to do highlightning.

Hmmm, forking lucene is not exactly a good way of working with them. I'd
suggest you to send the patches to them and see what comes up from
there.

I would be against having a ad-hoc modified version of Lucene into our
CVS.
 
> Moreover if you want to have something like highligthning, the question
> is if the summary should be stored in the
> index, too, or should we ask for the cocoon-view again, at search-time,
> to get the summary?

Right, I was thinking the same thing.

Performance-wise, the obvious answer is to store the summary along with
the index.
 
> I have implemented the LuceneIndexContentHandler to generate no-store
> fields, body, and all the element, and attribute fields are not stored
> only indexed fields,
> Now adding a summary might make it worth to add the body field as
> stored. But what about the
> <s1 title="Introdcution">? The "Introduction" is not stored in the body.
> How should we summarize this?

Attributes can appear only once, what about wrapping them with square
brakets?

 [Introduction] This text is something that blah 
 blah blah [How to blah blah] blah blah blah

but I'm wide open to suggestions here.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Re: Ant: Re: Adding XML searching with Lucene

Reply via email to