Re: Proposal for simple LCF deployment model

Jack Krupansky Mon, 31 May 2010 17:02:03 -0700

(1) Operation and configuration of LCF to look exactly like Solr, in thatthe whole interaction with LCF is done via configuration file.(2) Use of the crawler UI to be completely optional, and in fact not theprimary way people would be expected to use LCF.
(3) LCF to run in one process, exactly like Solr.
(4) LCF to not require an external database, exactly like Solr.
(5) LCF to be completely prebuilt, exactly like Solr.


Not quite. Let me rewrite those goals:

(1) Operation and *initial* configuration of LCF to look *similar* to Solr,in that LCF *can be initially configured* via configuration file.(2) Use of the crawler UI to be completely optional *for initial evaluationconfiguration of default output and repository connections for severalsample jobs (e.g., crawl an examples directory, similar to solr, and aportion of the LCF web site). People would then add additional connectionsas desired, as well as additional connectors via plug-ins in a Solr-likeconfiguration file.*(3) LCF *can be* run *for initial configuration of connections for initialevaluation via a single console command*, exactly like Solr.(4) LCF *would not* require an external database *setup for an initialevaluation*, exactly like Solr.(5) LCF *with a default set of connectors* to be completely prebuilt,exactly like Solr.

The main thrust of my idea was for: 1) simplified initial evaluation, alaSolr, and 2) a deployed app could *initialize* the LCF configurationprogrammatically, ala Solr. No need for exactness since Solr and LCF aredifferent beasts, albeit in the same jungle.

It is not my intention to completely specify all design details at thistime, just the general idea in as loose language as possible. I am surethere are plenty of technical considerations that will need to be addressed,over time. Others can refine the ideas as needed.

I think the idea, for the longer run, of plug-in connectors is ratherimportant, but not essential for the goal here of simplifying initialevaluation.


-- Jack Krupansky

--------------------------------------------------
From: <karl.wri...@nokia.com>
Sent: Monday, May 31, 2010 6:54 PM
To: <connectors-dev@incubator.apache.org>
Subject: RE: Proposal for simple LCF deployment model

So, Jack, let me see if I understand you.  You wan to seet:
(1) Operation and configuration of LCF to look exactly like Solr, in thatthe whole interaction with LCF is done via configuration file.(2) Use of the crawler UI to be completely optional, and in fact not theprimary way people would be expected to use LCF.
(3) LCF to run in one process, exactly like Solr.
(4) LCF to not require an external database, exactly like Solr.
(5) LCF to be completely prebuilt, exactly like Solr.
It is clear, then, that the "usage scenarios" I asked you for beforeprimarily involve integrating LCF into some other platform or application.The whole gist of your requirement centers around that. And while Igenerally agree that this is a good goal, I think you will find that thereare some issues that you should consider, especially before recommendingparticular solutions:
(a) Configuration of connections, authorities, and jobs in LCF iscompletely dependent on the connector, authority, or output connectorimplementation. While this information is stored by the framework as XML,it's the underlying connector code that knows what to make of it. Theconnector's crawler UI plugins know how to edit it. But when you directlyedit it via an XML configuration file, you throw away half of thatabstraction. That means that you become responsible at the platform levelfor knowing fairly intimate details of how every connector's configurationinformation is formed. Indeed, you may need to interact directly with UIsupport methods in the connector in order to be able to properly set upappropriate configuration information - most connectors do this. You maynot have noticed this, possibly because all you have ever tried so farwith LCF are the public connectors. Please understand that these areunusual in that there is no configuration-time interaction with theseconnectors.
If we do what you propose, therefore, a great deal of LCF's ability tosupport true plug-in connector development would go away. So Ifundamentally don't believe your scenario is realistic, and would like youto do a more thorough assessment before we go ahead on this.Specifically, I would want to know how you intend to solve the connectionand job configuration problem, beyond just "they write the XMLthemselves".
The execution model whereby the connections and jobs get defined wouldalso have problems, because the connection and job definitionsfundamentally live in the database, not in the configuration file. Theyare *meant* to be dynamically modified by the user, while LCF is doing itsthing. Such changes could not automatically back to a configuration fileby any mechanism that seems reasonable to me. So the whole Solr-typeconfiguration model seems inappropriate to me for LCF. The only part ofyour request I am in favor of at this time is possibly automaticregistration of connectors and authorities during the start up of a(proposed) jetty-based combined LCF process.
(b) Getting rid of an external database dependency sounds fine, in theory.But unless an embedded database has decent operating characteristics, thenyou would risk turning LCF into a toy.
One reason I encountered while trying to work with Derby is that aspecific critical operation, namely pulling documents from the queue andhanding them to threads, depends strongly on having a functioning LIMITclause. Otherwise, each time that query is run, the database effectivelyneeds to return every live document in the queue, which could beenormously painful for both performance and memory consumption. Derbydoes not support the LIMIT construct. I haven't yet gotten Derby towork - I haven't been able to get any tables installed yet. But ifexperiments bear out my concern, you will find that Derby may limit LCF toa few thousand overall documents before falling over and dying. We'llsee. In the interim, maybe you'd want to do some research and experimentsto find an embedded Java database whose implementation is more complete.
It's also worth noting that nobody else who has used LCF has had anydifficulties whatsoever in setting up Postgresql to suit their needs. SoI suspect that your real complaint is that involving Postgresql makesintegration of LCF into some other platform more difficult - and, frankly,that's your problem and not LCF's.
(c) Prebuilding a-la the Solr model is certainly a worthwhile goal. Thisis certainly something that I'm interested in working towards. At theleast, I am happy to turn the configuration file into XML a la Solr.However, please also bear in mind that this would only cover propertiescurrently present in properties.ini - the logging.ini conf file format isdetermined by log4j, not LCF, and thus for now logging.ini will need toremain in the "simple name-value pairs" form that you do not like.Integrating logging configuration with LCF's configuration file is nodoubt possible, but I'm not quite sure how one would do it at this point,nor do I know whether Solr has implemented a unified logging configurationsolution. I'm also willing to look at a class loader implementation thatwould allow for connector jars to be delivered using a Solr-style "libdirectory", which I think would be the prerequisite for canned Solr-stylebuilds. Revamping the build system would also be needed, but this isgoing to require a lot of careful thought and planning, so I expect itwill be some months before anything is changed there.
Karl

________________________________________
From: ext Jack Krupansky [jack.krupan...@lucidimagination.com]
Sent: Friday, May 28, 2010 4:19 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

I meant the lcf.agents.RegisterOutput org.apache.lcf.agents.output.* and
lcf.crawler.Register org.apache.lcf.crawler.connectors.* types ofoperations
that are currently executed as standalone commands, as well as the
connections created using the UI. So, you would have config file entriesfor
both the registration of connector "classes" and the definition of the
actual connections in some new form of "config" file. Sure, the connector
registration initializes the database, but it is all part of thecollection
of operations that somebody has to perform to go from scratch to an LCF
configuration that is ready to "Start" a crawl. Better to have one (or two
or three if necessary) "config" file that encompasses the entire
configuration setup rather than separate manual steps.

Whether it is high enough priority for the first release is a matter for
debate.

-- Jack Krupansky

--------------------------------------------------
From: <karl.wri...@nokia.com>
Sent: Friday, May 28, 2010 11:16 AM
To: <connectors-dev@incubator.apache.org>
Subject: Re: Proposal for simple LCF deployment model
Dump and restore functionality already exists, but the format is not xml.
Providing and xml dump and restore is straightforward. Making such afile
operate like a true config file is not.

This, by the way, has nothing to do with registering connectors, which is
a datatbase initialization operation.

Karl

--- original message ---
From: "ext Jack Krupansky" <jack.krupan...@lucidimagination.com>
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:33:34  AM
(b) The alternative starting point should probably autocreate the
database,
and should also autoregister all connectors.  This will require a list,
somewhere,
of the connectors and authorities that are included, and their preferred
UI
names for that installation.  This could come from the configuration
information, or from some other place.  Any ideas?
I would like to see two things: 1) A way to request LCF to "dump" all
configuration parameters, including parameters for all outputconnections,
repositories,  jobs, et al to an "LCF config file", and 2) The ability to
start from scratch with a fresh deployment of LCF and feed it that config
file to then create all of the output connections, repositoryconnections,
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a
matter for debate. Whether it is a separate file from the current config
file can also be a matter for debate.
But, in short, the answer to your question would be that there would bean
LCF config file (not just the simple keyword/value file that LCF has for
global configuration settings) to see the initial output connections,
repository connections, et al.
Maybe this config file is a little closer to the Solr schema file. Ithink
it feels that way. OTOH, the list of registered connectors, as opposed to
the user-created connections that use those connectors, seems more like
Solr
request handlers that are in solrconfig.xml, so maybe the initial
"configuration" would be split into two separate files as in Solr. Or,
maybe, the Solr guys have a better proposal for how they would have
managed
that split in Solr if they had it to do all over again. My preference
would
be one file for the whole configuration.
Another advantage of such a config file is that it is easier for peopleto
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

--------------------------------------------------
From: <karl.wri...@nokia.com>
Sent: Friday, May 28, 2010 5:48 AM
To: <connectors-dev@incubator.apache.org>
Subject: Proposal for simple LCF deployment model
The current LCF standard deployment model requires a number of moving
parts, which are probably necessary in some cases, but simply introduce
complexity in others.  It has occurred to me that it may be possible to
provide an alternate deployment model involving Jetty, which wouldreduce
the number of moving parts by one (by eliminating Tomcat).  A simple LCF
deployment could then, in principle, look pretty much like Solr's.

In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class
isolation similar to Tomcat's.  If this condition is not met, we'd need
to
build both a Tomcat and a Jetty version of each webapp.
The overall set of changes that would be required would be thefollowing:
(a) An alternative "start" entry point would need to be coded, which
would
start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the
database, and should also autoregister all connectors. This willrequire
a list, somewhere, of the connectors and authorities that are included,
and their preferred UI names for that installation. This could comefrom
the configuration information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process,
which would be the equivalent of the solr start.jar, so as to make
running
the whole stack trivial.
(d) An "LCF API" web application, which provides access to all of the
current LCF commands, would also be an obvious requirement to go forward
with this model.
What are the disadvantages? Well, I think that the main problem wouldbe
security.  This deployment model, though simple, does not control access
to LCF is any way.  You'd need to introduce another moving part to do
that.
Bear in mind that this change would still not allow LCF to run usingonly
one process.  There are still separate RMI-based processes needed for
some
connectors (Documentum and FileNet).  Although these could in theory be
started up using Java Activation, a main reason for a separate processin
Documentum's case is that DFC randomly crashes the JVM under which it
runs, and thus needs to be independently restarted if and when it dies.
If anyone has experience with Java Activation and wants to contribute
their time to develop infrastructure that can deal with that problem,
please let me know.

Finally, there is no way around the fact that LCF requires a
well-performing database, which constitutes an independent moving partof
its own.  This proposal does nothing to change that at all.

Please note that I'm not proposing that the current model go away, but
rather that we support both.

Thoughts?
Karl

Re: Proposal for simple LCF deployment model

Reply via email to