Re: Proposal for simple LCF deployment model

2010-05-31 Thread Jack Krupansky
(1) Operation and configuration of LCF to look exactly like Solr, in that 
the whole interaction with LCF is done via configuration file.
(2) Use of the crawler UI to be completely optional, and in fact not the 
primary way people would be expected to use LCF.

(3) LCF to run in one process, exactly like Solr.
(4) LCF to not require an external database, exactly like Solr.
(5) LCF to be completely prebuilt, exactly like Solr.


Not quite. Let me rewrite those goals:

(1) Operation and *initial* configuration of LCF to look *similar* to Solr, 
in that LCF *can be initially configured* via configuration file.
(2) Use of the crawler UI to be completely optional *for initial evaluation 
configuration of default output and repository connections for several 
sample jobs (e.g., crawl an examples directory, similar to solr, and a 
portion of the LCF web site). People would then add additional connections 
as desired, as well as additional connectors via plug-ins in a Solr-like 
configuration file.*
(3) LCF *can be* run *for initial configuration of connections for initial 
evaluation via a single console command*, exactly like Solr.
(4) LCF *would not* require an external database *setup for an initial 
evaluation*, exactly like Solr.
(5) LCF *with a default set of connectors* to be completely prebuilt, 
exactly like Solr.


The main thrust of my idea was for: 1) simplified initial evaluation, ala 
Solr, and 2) a deployed app could *initialize* the LCF configuration 
programmatically, ala Solr. No need for exactness since Solr and LCF are 
different beasts, albeit in the same jungle.


It is not my intention to completely specify all design details at this 
time, just the general idea in as loose language as possible. I am sure 
there are plenty of technical considerations that will need to be addressed, 
over time. Others can refine the ideas as needed.


I think the idea, for the longer run, of plug-in connectors is rather 
important, but not essential for the goal here of simplifying initial 
evaluation.


-- Jack Krupansky

--
From: 
Sent: Monday, May 31, 2010 6:54 PM
To: 
Subject: RE: Proposal for simple LCF deployment model


So, Jack, let me see if I understand you.  You wan to seet:

(1) Operation and configuration of LCF to look exactly like Solr, in that 
the whole interaction with LCF is done via configuration file.
(2) Use of the crawler UI to be completely optional, and in fact not the 
primary way people would be expected to use LCF.

(3) LCF to run in one process, exactly like Solr.
(4) LCF to not require an external database, exactly like Solr.
(5) LCF to be completely prebuilt, exactly like Solr.

It is clear, then, that the "usage scenarios" I asked you for before 
primarily involve integrating LCF into some other platform or application. 
The whole gist of your requirement centers around that.  And while I 
generally agree that this is a good goal, I think you will find that there 
are some issues that you should consider, especially before recommending 
particular solutions:


(a) Configuration of connections, authorities, and jobs in LCF is 
completely dependent on the connector, authority, or output connector 
implementation.  While this information is stored by the framework as XML, 
it's the underlying connector code that knows what to make of it.  The 
connector's crawler UI plugins know how to edit it.  But when you directly 
edit it via an XML configuration file, you throw away half of that 
abstraction.  That means that you become responsible at the platform level 
for knowing fairly intimate details of how every connector's configuration 
information is formed.  Indeed, you may need to interact directly with UI 
support methods in the connector in order to be able to properly set up 
appropriate configuration information - most connectors do this.  You may 
not have noticed this, possibly because all you have ever tried so far 
with LCF are the public connectors.  Please understand that these are 
unusual in that there is no configuration-time interaction with these 
connectors.


If we do what you propose, therefore, a great deal of LCF's ability to 
support true plug-in connector development would go away.  So I 
fundamentally don't believe your scenario is realistic, and would like you 
to do a more thorough assessment before we go ahead on this. 
Specifically, I would want to know how you intend to solve the connection 
and job configuration problem, beyond just "they write the XML 
themselves".


The execution model whereby the connections and jobs get defined would 
also have problems, because the connection and job definitions 
fundamentally live in the database, not in the configuration file.  They 
are *meant* to be dynamically modified by the user, while LCF is doing its 
thing.  Such changes could not automatically back to a configuration file 
by any mechanism that se

RE: Proposal for simple LCF deployment model

2010-05-31 Thread karl.wright
 you do not like.  Integrating logging configuration with LCF's 
configuration file is no doubt possible, but I'm not quite sure how one would 
do it at this point, nor do I know whether Solr has implemented a unified 
logging configuration solution.  I'm also willing to look at a class loader 
implementation that would allow for connector jars to be delivered using a 
Solr-style "lib directory", which I think would be the prerequisite for canned 
Solr-style builds.  Revamping the build system would also be needed, but this 
is going to require a lot of careful thought and planning, so I expect it will 
be some months before anything is changed there.

Karl


From: ext Jack Krupansky [jack.krupan...@lucidimagination.com]
Sent: Friday, May 28, 2010 4:19 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

I meant the lcf.agents.RegisterOutput org.apache.lcf.agents.output.* and
lcf.crawler.Register org.apache.lcf.crawler.connectors.* types of operations
that are currently executed as standalone commands, as well as the
connections created using the UI. So, you would have config file entries for
both the registration of connector "classes" and the definition of the
actual connections in some new form of "config" file. Sure, the connector
registration initializes the database, but it is all part of the collection
of operations that somebody has to perform to go from scratch to an LCF
configuration that is ready to "Start" a crawl. Better to have one (or two
or three if necessary) "config" file that encompasses the entire
configuration setup rather than separate manual steps.

Whether it is high enough priority for the first release is a matter for
debate.

-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 11:16 AM
To: 
Subject: Re: Proposal for simple LCF deployment model

> Dump and restore functionality already exists, but the format is not xml.
>
> Providing and xml dump and restore is straightforward.  Making such a file
> operate like a true config file is not.
>
> This, by the way, has nothing to do with registering connectors, which is
> a datatbase initialization operation.
>
> Karl
>
> --- original message ---
> From: "ext Jack Krupansky" 
> Subject: Re: Proposal for simple LCF deployment model
> Date: May 28, 2010
> Time: 10:33:34  AM
>
>
>> (b) The alternative starting point should probably autocreate the
>> database,
>> and should also autoregister all connectors.  This will require a list,
>> somewhere,
>> of the connectors and authorities that are included, and their preferred
>> UI
>> names for that installation.  This could come from the configuration
>> information, or from some other place.  Any ideas?
>
> I would like to see two things: 1) A way to request LCF to "dump" all
> configuration parameters, including parameters for all output connections,
> repositories,  jobs, et al to an "LCF config file", and 2) The ability to
> start from scratch with a fresh deployment of LCF and feed it that config
> file to then create all of the output connections, repository connections,
> and jobs to match the LCF configuration state desired.
>
> Now, whether that config file is simple XML ala solrconfig.xml can be a
> matter for debate. Whether it is a separate file from the current config
> file can also be a matter for debate.
>
> But, in short, the answer to your question would be that there would be an
> LCF config file (not just the simple keyword/value file that LCF has for
> global configuration settings) to see the initial output connections,
> repository connections, et al.
>
> Maybe this config file is a little closer to the Solr schema file. I think
> it feels that way. OTOH, the list of registered connectors, as opposed to
> the user-created connections that use those connectors, seems more like
> Solr
> request handlers that are in solrconfig.xml, so maybe the initial
> "configuration" would be split into two separate files as in Solr. Or,
> maybe, the Solr guys have a better proposal for how they would have
> managed
> that split in Solr if they had it to do all over again. My preference
> would
> be one file for the whole configuration.
>
> Another advantage of such a config file is that it is easier for people to
> post problem reports that show exactly how they set up LCF.
>
> -- Jack Krupansky
>
> --
> From: 
> Sent: Friday, May 28, 2010 5:48 AM
> To: 
> Subject: Proposal for simple LCF deployment model
>
>> The current LCF standard deployment model requires a number of moving
>

Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky
Okay, so LCF has a "config" file it loads on startup of the agents process 
to set a small number of keyword/value parameters. The config file I 
proposed would be used to "initialize" the LCF database. That would 
initialize/re-initialize the "configuration" (connector classes, 
connections, authorities, jobs, et al) from that "saved configuration" file.


So, the use case would be to set up an initial configuration of LCF, save it 
to a file (presumably XML for readability and manual edit), and then the 
deployment of LCF can be driven from that saved configuration file which 
would initialize the connector classes and connections, et al in the 
database.


This "initialize database from saved configuration file" operation could be 
used for an evaluation deployment of LCF itself or for deployment of an 
application that includes LCF.


The other use case which I previously mentioned is that users can upload 
such a saved configuration file when they are requesting assistance. 
Similarly, support staff for applications could also request that saved 
configuration file from their customers to assist in tracking down issues.


So, all of the LCF "objects" would indeed live in the database, but the 
saved configuration file would be a usable form for offsite examination and 
recreation of the configuration that lives in the database.


-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 2:20 PM
To: 
Subject: RE: Proposal for simple LCF deployment model

I already posted a response to this, but since it didn't seem to appear 
I'm going to try again.


LCF already has dump and restore commands, but they don't currently write 
XML, they write binary data.  Providing a way to write and read XML would 
be relatively straightforward.  But this is *not* the same thing as 
defining everything in a global configuration file.  LCF's connection, 
authority, and job definitions belong in the database.


Another proposal that would be much more Solr-like would be to allow you 
to define such things through a servlet API.  This is the approach I'd 
thought would work the best for the most people.


Note that this is a very different question to the question of registering 
connectors and authorities.  The latter operation is more akin to database 
initialization, and would be done in lieu of the current series of 
connector registration commands that you need to do to install connectors 
into LCF.  It may even be that the proper answer is still not to do this 
step at all on the quick start, although I personally think the ideal 
would be some kind of automatic registration of all connectors and 
authorities that had been built during the last build step.


Given this analysis, can you clarify your request?  I'd also like to see 
use cases because without them we're just shooting the breeze.


Karl

-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com]
Sent: Friday, May 28, 2010 10:33 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model


(b) The alternative starting point should probably autocreate the
database,
and should also autoregister all connectors.  This will require a list,
somewhere,
of the connectors and authorities that are included, and their preferred
UI
names for that installation.  This could come from the configuration
information, or from some other place.  Any ideas?


I would like to see two things: 1) A way to request LCF to "dump" all
configuration parameters, including parameters for all output connections,
repositories,  jobs, et al to an "LCF config file", and 2) The ability to
start from scratch with a fresh deployment of LCF and feed it that config
file to then create all of the output connections, repository connections,
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a
matter for debate. Whether it is a separate file from the current config
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an
LCF config file (not just the simple keyword/value file that LCF has for
global configuration settings) to see the initial output connections,
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think
it feels that way. OTOH, the list of registered connectors, as opposed to
the user-created connections that use those connectors, seems more like 
Solr

request handlers that are in solrconfig.xml, so maybe the initial
"configuration" would be split into two separate files as in Solr. Or,
maybe, the Solr guys have a better proposal for how they would have 
managed
that split in Solr if they had it to do all over again. My preference 
would

be one file for t

RE: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
I already posted a response to this, but since it didn't seem to appear I'm 
going to try again.

LCF already has dump and restore commands, but they don't currently write XML, 
they write binary data.  Providing a way to write and read XML would be 
relatively straightforward.  But this is *not* the same thing as defining 
everything in a global configuration file.  LCF's connection, authority, and 
job definitions belong in the database.

Another proposal that would be much more Solr-like would be to allow you to 
define such things through a servlet API.  This is the approach I'd thought 
would work the best for the most people.

Note that this is a very different question to the question of registering 
connectors and authorities.  The latter operation is more akin to database 
initialization, and would be done in lieu of the current series of connector 
registration commands that you need to do to install connectors into LCF.  It 
may even be that the proper answer is still not to do this step at all on the 
quick start, although I personally think the ideal would be some kind of 
automatic registration of all connectors and authorities that had been built 
during the last build step.

Given this analysis, can you clarify your request?  I'd also like to see use 
cases because without them we're just shooting the breeze.

Karl

-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] 
Sent: Friday, May 28, 2010 10:33 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

> (b) The alternative starting point should probably autocreate the 
> database,
> and should also autoregister all connectors.  This will require a list, 
> somewhere,
> of the connectors and authorities that are included, and their preferred 
> UI
> names for that installation.  This could come from the configuration
> information, or from some other place.  Any ideas?

I would like to see two things: 1) A way to request LCF to "dump" all 
configuration parameters, including parameters for all output connections, 
repositories,  jobs, et al to an "LCF config file", and 2) The ability to 
start from scratch with a fresh deployment of LCF and feed it that config 
file to then create all of the output connections, repository connections, 
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a 
matter for debate. Whether it is a separate file from the current config 
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an 
LCF config file (not just the simple keyword/value file that LCF has for 
global configuration settings) to see the initial output connections, 
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think 
it feels that way. OTOH, the list of registered connectors, as opposed to 
the user-created connections that use those connectors, seems more like Solr 
request handlers that are in solrconfig.xml, so maybe the initial 
"configuration" would be split into two separate files as in Solr. Or, 
maybe, the Solr guys have a better proposal for how they would have managed 
that split in Solr if they had it to do all over again. My preference would 
be one file for the whole configuration.

Another advantage of such a config file is that it is easier for people to 
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

--------------
From: 
Sent: Friday, May 28, 2010 5:48 AM
To: 
Subject: Proposal for simple LCF deployment model

> The current LCF standard deployment model requires a number of moving 
> parts, which are probably necessary in some cases, but simply introduce 
> complexity in others.  It has occurred to me that it may be possible to 
> provide an alternate deployment model involving Jetty, which would reduce 
> the number of moving parts by one (by eliminating Tomcat).  A simple LCF 
> deployment could then, in principle, look pretty much like Solr's.
>
> In order for this to work, the following has to be true:
>
> (1) jetty's basic JSP support must be comparable to Tomcat's.
> (2) the class loader that jetty uses for webapp's must provide class 
> isolation similar to Tomcat's.  If this condition is not met, we'd need to 
> build both a Tomcat and a Jetty version of each webapp.
>
> The overall set of changes that would be required would be the following:
> (a) An alternative "start" entry point would need to be coded, which would 
> start Jetty running the lcf-crawler-ui and lcf-authority-service webapps 
> before bringing up the agents engine.
> (b) The alternative starting point should probably autocreate t

Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky
I meant the lcf.agents.RegisterOutput org.apache.lcf.agents.output.* and 
lcf.crawler.Register org.apache.lcf.crawler.connectors.* types of operations 
that are currently executed as standalone commands, as well as the 
connections created using the UI. So, you would have config file entries for 
both the registration of connector "classes" and the definition of the 
actual connections in some new form of "config" file. Sure, the connector 
registration initializes the database, but it is all part of the collection 
of operations that somebody has to perform to go from scratch to an LCF 
configuration that is ready to "Start" a crawl. Better to have one (or two 
or three if necessary) "config" file that encompasses the entire 
configuration setup rather than separate manual steps.


Whether it is high enough priority for the first release is a matter for 
debate.


-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 11:16 AM
To: 
Subject: Re: Proposal for simple LCF deployment model


Dump and restore functionality already exists, but the format is not xml.

Providing and xml dump and restore is straightforward.  Making such a file 
operate like a true config file is not.


This, by the way, has nothing to do with registering connectors, which is 
a datatbase initialization operation.


Karl

--- original message ---
From: "ext Jack Krupansky" 
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:33:34  AM



(b) The alternative starting point should probably autocreate the
database,
and should also autoregister all connectors.  This will require a list,
somewhere,
of the connectors and authorities that are included, and their preferred
UI
names for that installation.  This could come from the configuration
information, or from some other place.  Any ideas?


I would like to see two things: 1) A way to request LCF to "dump" all
configuration parameters, including parameters for all output connections,
repositories,  jobs, et al to an "LCF config file", and 2) The ability to
start from scratch with a fresh deployment of LCF and feed it that config
file to then create all of the output connections, repository connections,
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a
matter for debate. Whether it is a separate file from the current config
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an
LCF config file (not just the simple keyword/value file that LCF has for
global configuration settings) to see the initial output connections,
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think
it feels that way. OTOH, the list of registered connectors, as opposed to
the user-created connections that use those connectors, seems more like 
Solr

request handlers that are in solrconfig.xml, so maybe the initial
"configuration" would be split into two separate files as in Solr. Or,
maybe, the Solr guys have a better proposal for how they would have 
managed
that split in Solr if they had it to do all over again. My preference 
would

be one file for the whole configuration.

Another advantage of such a config file is that it is easier for people to
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

--------------
From: 
Sent: Friday, May 28, 2010 5:48 AM
To: 
Subject: Proposal for simple LCF deployment model


The current LCF standard deployment model requires a number of moving
parts, which are probably necessary in some cases, but simply introduce
complexity in others.  It has occurred to me that it may be possible to
provide an alternate deployment model involving Jetty, which would reduce
the number of moving parts by one (by eliminating Tomcat).  A simple LCF
deployment could then, in principle, look pretty much like Solr's.

In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class
isolation similar to Tomcat's.  If this condition is not met, we'd need 
to

build both a Tomcat and a Jetty version of each webapp.

The overall set of changes that would be required would be the following:
(a) An alternative "start" entry point would need to be coded, which 
would

start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the
database, and should also autoregister all connectors.  This will require
a list, somewhere, of the connectors and authorities that are included,
and their preferred UI names for that installatio

RE: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
I already posted a response to this, but since it didn't seem to appear I'm 
going to try again.

LCF already has dump and restore commands, but they don't currently write XML, 
they write binary data.  Providing a way to write and read XML would be 
relatively straightforward.  But this is *not* the same thing as defining 
everything in a global configuration file.  LCF's connection, authority, and 
job definitions belong in the database.

Another proposal that would be much more Solr-like would be to allow you to 
define such things through a servlet API.  This is the approach I'd thought 
would work the best for the most people.

Note that this is a very different question to the question of registering 
connectors and authorities.  The latter operation is more akin to database 
initialization, and would be done in lieu of the current series of connector 
registration commands that you need to do to install connectors into LCF.  It 
may even be that the proper answer is still not to do this step at all on the 
quick start, although I personally think the ideal would be some kind of 
automatic registration of all connectors and authorities that had been built 
during the last build step.

Given this analysis, can you clarify your request?  I'd also like to see use 
cases because without them we're just shooting the breeze.

Karl

-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] 
Sent: Friday, May 28, 2010 10:33 AM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

> (b) The alternative starting point should probably autocreate the 
> database,
> and should also autoregister all connectors.  This will require a list, 
> somewhere,
> of the connectors and authorities that are included, and their preferred 
> UI
> names for that installation.  This could come from the configuration
> information, or from some other place.  Any ideas?

I would like to see two things: 1) A way to request LCF to "dump" all 
configuration parameters, including parameters for all output connections, 
repositories,  jobs, et al to an "LCF config file", and 2) The ability to 
start from scratch with a fresh deployment of LCF and feed it that config 
file to then create all of the output connections, repository connections, 
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a 
matter for debate. Whether it is a separate file from the current config 
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an 
LCF config file (not just the simple keyword/value file that LCF has for 
global configuration settings) to see the initial output connections, 
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think 
it feels that way. OTOH, the list of registered connectors, as opposed to 
the user-created connections that use those connectors, seems more like Solr 
request handlers that are in solrconfig.xml, so maybe the initial 
"configuration" would be split into two separate files as in Solr. Or, 
maybe, the Solr guys have a better proposal for how they would have managed 
that split in Solr if they had it to do all over again. My preference would 
be one file for the whole configuration.

Another advantage of such a config file is that it is easier for people to 
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

--------------
From: 
Sent: Friday, May 28, 2010 5:48 AM
To: 
Subject: Proposal for simple LCF deployment model

> The current LCF standard deployment model requires a number of moving 
> parts, which are probably necessary in some cases, but simply introduce 
> complexity in others.  It has occurred to me that it may be possible to 
> provide an alternate deployment model involving Jetty, which would reduce 
> the number of moving parts by one (by eliminating Tomcat).  A simple LCF 
> deployment could then, in principle, look pretty much like Solr's.
>
> In order for this to work, the following has to be true:
>
> (1) jetty's basic JSP support must be comparable to Tomcat's.
> (2) the class loader that jetty uses for webapp's must provide class 
> isolation similar to Tomcat's.  If this condition is not met, we'd need to 
> build both a Tomcat and a Jetty version of each webapp.
>
> The overall set of changes that would be required would be the following:
> (a) An alternative "start" entry point would need to be coded, which would 
> start Jetty running the lcf-crawler-ui and lcf-authority-service webapps 
> before bringing up the agents engine.
> (b) The alternative starting point should probably autocreate t

Re: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
Dump and restore functionality already exists, but the format is not xml.

Providing and xml dump and restore is straightforward.  Making such a file 
operate like a true config file is not.

This, by the way, has nothing to do with registering connectors, which is a 
datatbase initialization operation.

Karl

--- original message ---
From: "ext Jack Krupansky" 
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:33:34  AM


> (b) The alternative starting point should probably autocreate the
> database,
> and should also autoregister all connectors.  This will require a list,
> somewhere,
> of the connectors and authorities that are included, and their preferred
> UI
> names for that installation.  This could come from the configuration
> information, or from some other place.  Any ideas?

I would like to see two things: 1) A way to request LCF to "dump" all
configuration parameters, including parameters for all output connections,
repositories,  jobs, et al to an "LCF config file", and 2) The ability to
start from scratch with a fresh deployment of LCF and feed it that config
file to then create all of the output connections, repository connections,
and jobs to match the LCF configuration state desired.

Now, whether that config file is simple XML ala solrconfig.xml can be a
matter for debate. Whether it is a separate file from the current config
file can also be a matter for debate.

But, in short, the answer to your question would be that there would be an
LCF config file (not just the simple keyword/value file that LCF has for
global configuration settings) to see the initial output connections,
repository connections, et al.

Maybe this config file is a little closer to the Solr schema file. I think
it feels that way. OTOH, the list of registered connectors, as opposed to
the user-created connections that use those connectors, seems more like Solr
request handlers that are in solrconfig.xml, so maybe the initial
"configuration" would be split into two separate files as in Solr. Or,
maybe, the Solr guys have a better proposal for how they would have managed
that split in Solr if they had it to do all over again. My preference would
be one file for the whole configuration.

Another advantage of such a config file is that it is easier for people to
post problem reports that show exactly how they set up LCF.

-- Jack Krupansky

------
From: 
Sent: Friday, May 28, 2010 5:48 AM
To: 
Subject: Proposal for simple LCF deployment model

> The current LCF standard deployment model requires a number of moving
> parts, which are probably necessary in some cases, but simply introduce
> complexity in others.  It has occurred to me that it may be possible to
> provide an alternate deployment model involving Jetty, which would reduce
> the number of moving parts by one (by eliminating Tomcat).  A simple LCF
> deployment could then, in principle, look pretty much like Solr's.
>
> In order for this to work, the following has to be true:
>
> (1) jetty's basic JSP support must be comparable to Tomcat's.
> (2) the class loader that jetty uses for webapp's must provide class
> isolation similar to Tomcat's.  If this condition is not met, we'd need to
> build both a Tomcat and a Jetty version of each webapp.
>
> The overall set of changes that would be required would be the following:
> (a) An alternative "start" entry point would need to be coded, which would
> start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
> before bringing up the agents engine.
> (b) The alternative starting point should probably autocreate the
> database, and should also autoregister all connectors.  This will require
> a list, somewhere, of the connectors and authorities that are included,
> and their preferred UI names for that installation.  This could come from
> the configuration information, or from some other place.  Any ideas?
> (c) There would need to an additional jar produced by the build process,
> which would be the equivalent of the solr start.jar, so as to make running
> the whole stack trivial.
> (d) An "LCF API" web application, which provides access to all of the
> current LCF commands, would also be an obvious requirement to go forward
> with this model.
>
> What are the disadvantages?  Well, I think that the main problem would be
> security.  This deployment model, though simple, does not control access
> to LCF is any way.  You'd need to introduce another moving part to do
> that.
>
> Bear in mind that this change would still not allow LCF to run using only
> one process.  There are still separate RMI-based processes needed for some
> connectors (Documentum and FileNet).  Although these could in theory be

Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky

The use cases I was considering for database issues are:

1) Desire for a very simple evaluation install process. See the Solr 
tutorial.
2) Desire for less complex and faster application deployment install 
process. PostgreSQL has a reputation for having "a large footprint."


Now, as machines and software evolve, it is not completely clear to me how 
"bad" PostgreSQL is these days, but having a separate deployment step to 
accommodate PostgreSQL interferes with use case #1.


That said, I am not sure that I would hold up getting the first official 
release of LCF out the door. After all, leading-edge ("bleeding-edge") users 
are used to more than a little inconvenience. Still, a Solr-simple 
evaluation install would be... "sweet".


-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 2:17 PM
To: 
Subject: RE: Proposal for simple LCF deployment model

I've been fighting with Derby for two days.  It's missing a significant 
amount of important functionality, and its user and database model are 
radically different from all other databases I know of.  (I'm also getting 
nonsense exceptions from it, but that's another matter.)  So regardless of 
how good the database abstraction layer is, expecting all databases to 
have sufficient functionality to get anything done is ridiculous.  If I 
get Derby working, I will let you know whether it is feasible at all to 
run LCF on in under any circumstances or not, but that *cannot* be the 
primary database people use with this project.  I'm also still waiting for 
a use-case from you as to how getting rid of the Postgresql database makes 
your life easier at all - and if your use case involves using Derby for 
anything serious, I'll have to say that I don't think that's realistic.


LCF has a very clean connector abstraction today.  So all we're really 
talking about is the build process here - whether it is possible to 
separate build and deployment of the framework and some connectors from 
the builds of other connectors.  Having each connector run as a separate 
process seems like overkill and would also impact performance pretty 
dramatically, as well as requiring quite a bit of additional 
configuration.  The "Solr plug-in model" is a bit better and requires only 
the addition of a custom classloader that explicitly loads any plugin 
classes and any classes that those use.  The required defines that some 
libraries need would have to be solved, but that needs doing anyway and I 
think I can have individual connectors set these as needed.


Karl



-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com]
Sent: Friday, May 28, 2010 1:49 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

But for a basic, early evaluation, "test drive", just the file system and
web repository connectors should be sufficient. And if there is a clean
database abstraction, a basic database package (e.g., derby) should be
sufficient for such a basic evaluation.

Are there technical reasons why third-party repository connectors cannot 
be
supported using a Solr-style "plug-in" approach? Or, worst case, as 
separate
processes with a clean inter-process API? Maybe not in the near-term, but 
as

a longer-term vision.

-- Jack Krupansky

------------------
From: 
Sent: Friday, May 28, 2010 11:10 AM
To: 
Subject: Re: Proposal for simple LCF deployment model


You forget that building lcf in its entirety requires that you supply
proprietary client components from third-party vendors.  So i think it is
unrealistic to expect canned builds that contain everything that you just
deploy.  For lcf i think the build cycle will thus be very common.

Getting rid of the database requirement is also obviously not an option.

Karl

--- original message ---
From: "ext Jack Krupansky" 
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:42:17  AM


A simple deployment ala Solr is a good goal. Integrating Jetty with the
LCF
deployment will go a long way towards that goal. The database software
deployment (PostgreSQL) is the other half of the hassle with deploying
LCF.

I think there are three distinct goals here: 1) A super-easy Solr-style
deployment for initial evaluation of LCF, 2) deployment of the LCF
components for full-blown application development where app server and
database might need to be different from the initial evaluation, and 3)
deployment of LCF components for production deployment of the full
application.

Right now, evaluation of LCF requires deployment of the source code and
building artifacts - Solr evaluation does not require that step.
Eliminated
the source and build step will certainly help simplify the evaluation
process.

Another possible consideration is that a

RE: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
I've been fighting with Derby for two days.  It's missing a significant amount 
of important functionality, and its user and database model are radically 
different from all other databases I know of.  (I'm also getting nonsense 
exceptions from it, but that's another matter.)  So regardless of how good the 
database abstraction layer is, expecting all databases to have sufficient 
functionality to get anything done is ridiculous.  If I get Derby working, I 
will let you know whether it is feasible at all to run LCF on in under any 
circumstances or not, but that *cannot* be the primary database people use with 
this project.  I'm also still waiting for a use-case from you as to how getting 
rid of the Postgresql database makes your life easier at all - and if your use 
case involves using Derby for anything serious, I'll have to say that I don't 
think that's realistic.

LCF has a very clean connector abstraction today.  So all we're really talking 
about is the build process here - whether it is possible to separate build and 
deployment of the framework and some connectors from the builds of other 
connectors.  Having each connector run as a separate process seems like 
overkill and would also impact performance pretty dramatically, as well as 
requiring quite a bit of additional configuration.  The "Solr plug-in model" is 
a bit better and requires only the addition of a custom classloader that 
explicitly loads any plugin classes and any classes that those use.  The 
required defines that some libraries need would have to be solved, but that 
needs doing anyway and I think I can have individual connectors set these as 
needed.

Karl



-Original Message-
From: ext Jack Krupansky [mailto:jack.krupan...@lucidimagination.com] 
Sent: Friday, May 28, 2010 1:49 PM
To: connectors-dev@incubator.apache.org
Subject: Re: Proposal for simple LCF deployment model

But for a basic, early evaluation, "test drive", just the file system and 
web repository connectors should be sufficient. And if there is a clean 
database abstraction, a basic database package (e.g., derby) should be 
sufficient for such a basic evaluation.

Are there technical reasons why third-party repository connectors cannot be 
supported using a Solr-style "plug-in" approach? Or, worst case, as separate 
processes with a clean inter-process API? Maybe not in the near-term, but as 
a longer-term vision.

-- Jack Krupansky

------
From: 
Sent: Friday, May 28, 2010 11:10 AM
To: 
Subject: Re: Proposal for simple LCF deployment model

> You forget that building lcf in its entirety requires that you supply 
> proprietary client components from third-party vendors.  So i think it is 
> unrealistic to expect canned builds that contain everything that you just 
> deploy.  For lcf i think the build cycle will thus be very common.
>
> Getting rid of the database requirement is also obviously not an option.
>
> Karl
>
> --- original message ---
> From: "ext Jack Krupansky" 
> Subject: Re: Proposal for simple LCF deployment model
> Date: May 28, 2010
> Time: 10:42:17  AM
>
>
> A simple deployment ala Solr is a good goal. Integrating Jetty with the 
> LCF
> deployment will go a long way towards that goal. The database software
> deployment (PostgreSQL) is the other half of the hassle with deploying 
> LCF.
>
> I think there are three distinct goals here: 1) A super-easy Solr-style
> deployment for initial evaluation of LCF, 2) deployment of the LCF
> components for full-blown application development where app server and
> database might need to be different from the initial evaluation, and 3)
> deployment of LCF components for production deployment of the full
> application.
>
> Right now, evaluation of LCF requires deployment of the source code and
> building artifacts - Solr evaluation does not require that step. 
> Eliminated
> the source and build step will certainly help simplify the evaluation
> process.
>
> Another possible consideration is that although some of us are especially
> interested in integration with Solr and doing so easily and robustly, Solr
> is just one of the output connections and LCF could be deployed for
> applications that do not involve Solr at all. So, maybe there should be an
> extra deployment wiki page for Solr guys that focuses on use of LCF with
> Solr and related issues. Whether that should be the default presentation 
> in
> the doc is a matter for debate. Right now, I see no harm with a Solr bias.
> At least it is a convenient way to demonstrate end-to-end use of LCF.
>
> -- Jack Krupansky
>
> --
> From: 
> Sent: Friday, May 28, 2010 5:48 AM
> To: 
> Subject: Proposal for simple LCF 

Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky
But for a basic, early evaluation, "test drive", just the file system and 
web repository connectors should be sufficient. And if there is a clean 
database abstraction, a basic database package (e.g., derby) should be 
sufficient for such a basic evaluation.


Are there technical reasons why third-party repository connectors cannot be 
supported using a Solr-style "plug-in" approach? Or, worst case, as separate 
processes with a clean inter-process API? Maybe not in the near-term, but as 
a longer-term vision.


-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 11:10 AM
To: 
Subject: Re: Proposal for simple LCF deployment model

You forget that building lcf in its entirety requires that you supply 
proprietary client components from third-party vendors.  So i think it is 
unrealistic to expect canned builds that contain everything that you just 
deploy.  For lcf i think the build cycle will thus be very common.


Getting rid of the database requirement is also obviously not an option.

Karl

--- original message ---
From: "ext Jack Krupansky" 
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:42:17  AM


A simple deployment ala Solr is a good goal. Integrating Jetty with the 
LCF

deployment will go a long way towards that goal. The database software
deployment (PostgreSQL) is the other half of the hassle with deploying 
LCF.


I think there are three distinct goals here: 1) A super-easy Solr-style
deployment for initial evaluation of LCF, 2) deployment of the LCF
components for full-blown application development where app server and
database might need to be different from the initial evaluation, and 3)
deployment of LCF components for production deployment of the full
application.

Right now, evaluation of LCF requires deployment of the source code and
building artifacts - Solr evaluation does not require that step. 
Eliminated

the source and build step will certainly help simplify the evaluation
process.

Another possible consideration is that although some of us are especially
interested in integration with Solr and doing so easily and robustly, Solr
is just one of the output connections and LCF could be deployed for
applications that do not involve Solr at all. So, maybe there should be an
extra deployment wiki page for Solr guys that focuses on use of LCF with
Solr and related issues. Whether that should be the default presentation 
in

the doc is a matter for debate. Right now, I see no harm with a Solr bias.
At least it is a convenient way to demonstrate end-to-end use of LCF.

-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 5:48 AM
To: 
Subject: Proposal for simple LCF deployment model


The current LCF standard deployment model requires a number of moving
parts, which are probably necessary in some cases, but simply introduce
complexity in others.  It has occurred to me that it may be possible to
provide an alternate deployment model involving Jetty, which would reduce
the number of moving parts by one (by eliminating Tomcat).  A simple LCF
deployment could then, in principle, look pretty much like Solr's.

In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class
isolation similar to Tomcat's.  If this condition is not met, we'd need 
to

build both a Tomcat and a Jetty version of each webapp.

The overall set of changes that would be required would be the following:
(a) An alternative "start" entry point would need to be coded, which 
would

start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the
database, and should also autoregister all connectors.  This will require
a list, somewhere, of the connectors and authorities that are included,
and their preferred UI names for that installation.  This could come from
the configuration information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process,
which would be the equivalent of the solr start.jar, so as to make 
running

the whole stack trivial.
(d) An "LCF API" web application, which provides access to all of the
current LCF commands, would also be an obvious requirement to go forward
with this model.

What are the disadvantages?  Well, I think that the main problem would be
security.  This deployment model, though simple, does not control access
to LCF is any way.  You'd need to introduce another moving part to do
that.

Bear in mind that this change would still not allow LCF to run using only
one process.  There are still separate RMI-based processes needed for 
some

connectors (Documentum and FileNet).  Al

Re: Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
You forget that building lcf in its entirety requires that you supply 
proprietary client components from third-party vendors.  So i think it is 
unrealistic to expect canned builds that contain everything that you just 
deploy.  For lcf i think the build cycle will thus be very common.

Getting rid of the database requirement is also obviously not an option.

Karl

--- original message ---
From: "ext Jack Krupansky" 
Subject: Re: Proposal for simple LCF deployment model
Date: May 28, 2010
Time: 10:42:17  AM


A simple deployment ala Solr is a good goal. Integrating Jetty with the LCF
deployment will go a long way towards that goal. The database software
deployment (PostgreSQL) is the other half of the hassle with deploying LCF.

I think there are three distinct goals here: 1) A super-easy Solr-style
deployment for initial evaluation of LCF, 2) deployment of the LCF
components for full-blown application development where app server and
database might need to be different from the initial evaluation, and 3)
deployment of LCF components for production deployment of the full
application.

Right now, evaluation of LCF requires deployment of the source code and
building artifacts - Solr evaluation does not require that step. Eliminated
the source and build step will certainly help simplify the evaluation
process.

Another possible consideration is that although some of us are especially
interested in integration with Solr and doing so easily and robustly, Solr
is just one of the output connections and LCF could be deployed for
applications that do not involve Solr at all. So, maybe there should be an
extra deployment wiki page for Solr guys that focuses on use of LCF with
Solr and related issues. Whether that should be the default presentation in
the doc is a matter for debate. Right now, I see no harm with a Solr bias.
At least it is a convenient way to demonstrate end-to-end use of LCF.

-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 5:48 AM
To: 
Subject: Proposal for simple LCF deployment model

> The current LCF standard deployment model requires a number of moving
> parts, which are probably necessary in some cases, but simply introduce
> complexity in others.  It has occurred to me that it may be possible to
> provide an alternate deployment model involving Jetty, which would reduce
> the number of moving parts by one (by eliminating Tomcat).  A simple LCF
> deployment could then, in principle, look pretty much like Solr's.
>
> In order for this to work, the following has to be true:
>
> (1) jetty's basic JSP support must be comparable to Tomcat's.
> (2) the class loader that jetty uses for webapp's must provide class
> isolation similar to Tomcat's.  If this condition is not met, we'd need to
> build both a Tomcat and a Jetty version of each webapp.
>
> The overall set of changes that would be required would be the following:
> (a) An alternative "start" entry point would need to be coded, which would
> start Jetty running the lcf-crawler-ui and lcf-authority-service webapps
> before bringing up the agents engine.
> (b) The alternative starting point should probably autocreate the
> database, and should also autoregister all connectors.  This will require
> a list, somewhere, of the connectors and authorities that are included,
> and their preferred UI names for that installation.  This could come from
> the configuration information, or from some other place.  Any ideas?
> (c) There would need to an additional jar produced by the build process,
> which would be the equivalent of the solr start.jar, so as to make running
> the whole stack trivial.
> (d) An "LCF API" web application, which provides access to all of the
> current LCF commands, would also be an obvious requirement to go forward
> with this model.
>
> What are the disadvantages?  Well, I think that the main problem would be
> security.  This deployment model, though simple, does not control access
> to LCF is any way.  You'd need to introduce another moving part to do
> that.
>
> Bear in mind that this change would still not allow LCF to run using only
> one process.  There are still separate RMI-based processes needed for some
> connectors (Documentum and FileNet).  Although these could in theory be
> started up using Java Activation, a main reason for a separate process in
> Documentum's case is that DFC randomly crashes the JVM under which it
> runs, and thus needs to be independently restarted if and when it dies.
> If anyone has experience with Java Activation and wants to contribute
> their time to develop infrastructure that can deal with that problem,
> please let me know.
>
> Finally, there is no way around the fact that LCF requires a
> well-performing database, whi

Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky
(b) The alternative starting point should probably autocreate the 
database,
and should also autoregister all connectors.  This will require a list, 
somewhere,
of the connectors and authorities that are included, and their preferred 
UI

names for that installation.  This could come from the configuration
information, or from some other place.  Any ideas?


I would like to see two things: 1) A way to request LCF to "dump" all 
configuration parameters, including parameters for all output connections, 
repositories,  jobs, et al to an "LCF config file", and 2) The ability to 
start from scratch with a fresh deployment of LCF and feed it that config 
file to then create all of the output connections, repository connections, 
and jobs to match the LCF configuration state desired.


Now, whether that config file is simple XML ala solrconfig.xml can be a 
matter for debate. Whether it is a separate file from the current config 
file can also be a matter for debate.


But, in short, the answer to your question would be that there would be an 
LCF config file (not just the simple keyword/value file that LCF has for 
global configuration settings) to see the initial output connections, 
repository connections, et al.


Maybe this config file is a little closer to the Solr schema file. I think 
it feels that way. OTOH, the list of registered connectors, as opposed to 
the user-created connections that use those connectors, seems more like Solr 
request handlers that are in solrconfig.xml, so maybe the initial 
"configuration" would be split into two separate files as in Solr. Or, 
maybe, the Solr guys have a better proposal for how they would have managed 
that split in Solr if they had it to do all over again. My preference would 
be one file for the whole configuration.


Another advantage of such a config file is that it is easier for people to 
post problem reports that show exactly how they set up LCF.


-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 5:48 AM
To: 
Subject: Proposal for simple LCF deployment model

The current LCF standard deployment model requires a number of moving 
parts, which are probably necessary in some cases, but simply introduce 
complexity in others.  It has occurred to me that it may be possible to 
provide an alternate deployment model involving Jetty, which would reduce 
the number of moving parts by one (by eliminating Tomcat).  A simple LCF 
deployment could then, in principle, look pretty much like Solr's.


In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class 
isolation similar to Tomcat's.  If this condition is not met, we'd need to 
build both a Tomcat and a Jetty version of each webapp.


The overall set of changes that would be required would be the following:
(a) An alternative "start" entry point would need to be coded, which would 
start Jetty running the lcf-crawler-ui and lcf-authority-service webapps 
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the 
database, and should also autoregister all connectors.  This will require 
a list, somewhere, of the connectors and authorities that are included, 
and their preferred UI names for that installation.  This could come from 
the configuration information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process, 
which would be the equivalent of the solr start.jar, so as to make running 
the whole stack trivial.
(d) An "LCF API" web application, which provides access to all of the 
current LCF commands, would also be an obvious requirement to go forward 
with this model.


What are the disadvantages?  Well, I think that the main problem would be 
security.  This deployment model, though simple, does not control access 
to LCF is any way.  You'd need to introduce another moving part to do 
that.


Bear in mind that this change would still not allow LCF to run using only 
one process.  There are still separate RMI-based processes needed for some 
connectors (Documentum and FileNet).  Although these could in theory be 
started up using Java Activation, a main reason for a separate process in 
Documentum's case is that DFC randomly crashes the JVM under which it 
runs, and thus needs to be independently restarted if and when it dies. 
If anyone has experience with Java Activation and wants to contribute 
their time to develop infrastructure that can deal with that problem, 
please let me know.


Finally, there is no way around the fact that LCF requires a 
well-performing database, which constitutes an independent moving part of 
its own.  This proposal does nothing to change that at all.


Please note that I'm not proposing that the current model go away, but 
rather that we support both.


Thoughts?
Karl 




Re: Proposal for simple LCF deployment model

2010-05-28 Thread Jack Krupansky
A simple deployment ala Solr is a good goal. Integrating Jetty with the LCF 
deployment will go a long way towards that goal. The database software 
deployment (PostgreSQL) is the other half of the hassle with deploying LCF.


I think there are three distinct goals here: 1) A super-easy Solr-style 
deployment for initial evaluation of LCF, 2) deployment of the LCF 
components for full-blown application development where app server and 
database might need to be different from the initial evaluation, and 3) 
deployment of LCF components for production deployment of the full 
application.


Right now, evaluation of LCF requires deployment of the source code and 
building artifacts - Solr evaluation does not require that step. Eliminated 
the source and build step will certainly help simplify the evaluation 
process.


Another possible consideration is that although some of us are especially 
interested in integration with Solr and doing so easily and robustly, Solr 
is just one of the output connections and LCF could be deployed for 
applications that do not involve Solr at all. So, maybe there should be an 
extra deployment wiki page for Solr guys that focuses on use of LCF with 
Solr and related issues. Whether that should be the default presentation in 
the doc is a matter for debate. Right now, I see no harm with a Solr bias. 
At least it is a convenient way to demonstrate end-to-end use of LCF.


-- Jack Krupansky

--
From: 
Sent: Friday, May 28, 2010 5:48 AM
To: 
Subject: Proposal for simple LCF deployment model

The current LCF standard deployment model requires a number of moving 
parts, which are probably necessary in some cases, but simply introduce 
complexity in others.  It has occurred to me that it may be possible to 
provide an alternate deployment model involving Jetty, which would reduce 
the number of moving parts by one (by eliminating Tomcat).  A simple LCF 
deployment could then, in principle, look pretty much like Solr's.


In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class 
isolation similar to Tomcat's.  If this condition is not met, we'd need to 
build both a Tomcat and a Jetty version of each webapp.


The overall set of changes that would be required would be the following:
(a) An alternative "start" entry point would need to be coded, which would 
start Jetty running the lcf-crawler-ui and lcf-authority-service webapps 
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the 
database, and should also autoregister all connectors.  This will require 
a list, somewhere, of the connectors and authorities that are included, 
and their preferred UI names for that installation.  This could come from 
the configuration information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process, 
which would be the equivalent of the solr start.jar, so as to make running 
the whole stack trivial.
(d) An "LCF API" web application, which provides access to all of the 
current LCF commands, would also be an obvious requirement to go forward 
with this model.


What are the disadvantages?  Well, I think that the main problem would be 
security.  This deployment model, though simple, does not control access 
to LCF is any way.  You'd need to introduce another moving part to do 
that.


Bear in mind that this change would still not allow LCF to run using only 
one process.  There are still separate RMI-based processes needed for some 
connectors (Documentum and FileNet).  Although these could in theory be 
started up using Java Activation, a main reason for a separate process in 
Documentum's case is that DFC randomly crashes the JVM under which it 
runs, and thus needs to be independently restarted if and when it dies. 
If anyone has experience with Java Activation and wants to contribute 
their time to develop infrastructure that can deal with that problem, 
please let me know.


Finally, there is no way around the fact that LCF requires a 
well-performing database, which constitutes an independent moving part of 
its own.  This proposal does nothing to change that at all.


Please note that I'm not proposing that the current model go away, but 
rather that we support both.


Thoughts?
Karl 




Proposal for simple LCF deployment model

2010-05-28 Thread karl.wright
The current LCF standard deployment model requires a number of moving parts, 
which are probably necessary in some cases, but simply introduce complexity in 
others.  It has occurred to me that it may be possible to provide an alternate 
deployment model involving Jetty, which would reduce the number of moving parts 
by one (by eliminating Tomcat).  A simple LCF deployment could then, in 
principle, look pretty much like Solr's.

In order for this to work, the following has to be true:

(1) jetty's basic JSP support must be comparable to Tomcat's.
(2) the class loader that jetty uses for webapp's must provide class isolation 
similar to Tomcat's.  If this condition is not met, we'd need to build both a 
Tomcat and a Jetty version of each webapp.

The overall set of changes that would be required would be the following:
(a) An alternative "start" entry point would need to be coded, which would 
start Jetty running the lcf-crawler-ui and lcf-authority-service webapps  
before bringing up the agents engine.
(b) The alternative starting point should probably autocreate the database, and 
should also autoregister all connectors.  This will require a list, somewhere, 
of the connectors and authorities that are included, and their preferred UI 
names for that installation.  This could come from the configuration 
information, or from some other place.  Any ideas?
(c) There would need to an additional jar produced by the build process, which 
would be the equivalent of the solr start.jar, so as to make running the whole 
stack trivial.
(d) An "LCF API" web application, which provides access to all of the current 
LCF commands, would also be an obvious requirement to go forward with this 
model.

What are the disadvantages?  Well, I think that the main problem would be 
security.  This deployment model, though simple, does not control access to LCF 
is any way.  You'd need to introduce another moving part to do that.

Bear in mind that this change would still not allow LCF to run using only one 
process.  There are still separate RMI-based processes needed for some 
connectors (Documentum and FileNet).  Although these could in theory be started 
up using Java Activation, a main reason for a separate process in Documentum's 
case is that DFC randomly crashes the JVM under which it runs, and thus needs 
to be independently restarted if and when it dies.  If anyone has experience 
with Java Activation and wants to contribute their time to develop 
infrastructure that can deal with that problem, please let me know.

Finally, there is no way around the fact that LCF requires a well-performing 
database, which constitutes an independent moving part of its own.  This 
proposal does nothing to change that at all.

Please note that I'm not proposing that the current model go away, but rather 
that we support both.

Thoughts?
Karl