Re: Configuration Management at Transformation Connectors

Rafa Haro Wed, 02 Jul 2014 03:28:30 -0700

Hi Karl,

El 02/07/14 12:13, Karl Wright escribió:

Hi Rafa,


First, some more basics.

- Configuration information is about "how" documents are indexed
- Specification information is about "what" documents or metadata is indexed

Configuration is specified as part of the connection definition;
specification is edited as part of the job.  Not all connectors have
configuration information; Tika doesn't, for instance.  Not all connectors
have specification information either.

Good. This is how I already understood it


The pooling is specific to Configuration information.  Connections which
have the same Configuration can be used in multiple jobs and have different
Specification information.  That is why you really should not be creating a
design where there's a large cost penalty in interpreting your connector's
Specification information; it's meant to be relatively light (and so far it
has been, in all connectors I've ever heard of.)  Furthermore, and this is
really really key, the specification information MUST be digestable to a
simple string, because that is how ManifoldCF determines whether changes
have occurred or not from one run to another.  This, too, must be easy and
fast to do, or your connector will be quite slow.

The specification information can be very easily digestable, but in somecases can be very difficult to manage. I will put an example: NamedEntity Recognition models are binary models that you need lo load inmemory in order to use them for extracting entities from documents'content. In terms of configuration, the model is simple a path to afile, so a single string. Also, I could have different models fordifferent jobs of course. But I want to apply the same model to all thedocuments processed in a single job. So, I just need to load in memorythe model only once for each job execution. And my real problem rightnow is that I can't find any proper place in the design of theTransformation Connector to do such unique initialization, because thepath to the model is hosted in the Specification object, but that objectis only passed in the getPipelineDescription method which is called forevery document processed. So, I can probably use the idiom I included inthe last email (check if the model object is null every time in thatmethod) but I was wondering if there is a better proper place to do it


Thanks,
Rafa


Thanks,
Karl



On Wed, Jul 2, 2014 at 4:17 AM, Rafa Haro <[email protected]> wrote:

Hi Karl,

First of all, thanks for your answers. I will read the proposed chapters
but, please, find inline further questions:

El 01/07/14 19:21, Karl Wright escribió:

  Hi Rafa,

Let me answer one question at a time.

bq. I would like to initialize the configuration object only once per job
execution. Because the configuration is not supposed to be changed during
a
job execution, I would like to be able to take the configuration
parameters
from ConfigParams and from Specification objects and create a unique
instance of my configuration object.

Connection instances are all pooled and reused.  You need to read about
their lifetime.  ManifoldCF in Action chapter 6 (IIRC) is where you will
find this: https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/
You should also be aware that there is *no* prohibition on configuration
or
specification changing during a job run; the framework is structured,
however, so that you don't need to worry about this when writing your
connector.

I understand this Karl. And precisely because of the pooling, it is hard
for me to believe that, during a job execution, the system is able to stop
all the threads and to freeze the execution for initializing again all the
connectors instances in the pool if the user changes the configuration. If
this not actually happen, then for example in implementations like current
Tika Extractor, the getPipelineDescription method will be returning always
exactly the same output version for all the crawled documents in the
current job. I understand the need to check the output version from job to
job, but not per document in a single job.

Also, there is something I'm completely missing: what is the output
specification and which is the difference with connectors and jobs
configuration?

bq. The getPipelineDescription method is quite confusing for me...

Getting a version string and indexing a document may well be separated in
time, and since it is possible for things to change in-between, the
version
string should be the basis of decisions your connector is making about how
to do things.  The version string is what gets actually stored in the DB,
so any differences will be picked up on later crawls.

FWIW, the IRepositoryConnnector interface predates the decision to not
include a document specification for every method call, and that has
persisted for backwards compatibility reasons, although in MCF 2.0 that
may
change.  The current design enforces proper connector coding.

bq. In the addOrReplaceDocumentWithExcept
ion, why is the pipelineDescription passed by parameter instead of the
connector Specification...?

See answer above.


bq. Is there a way to reuse a single configuration object per job
execution? In the Output processor connector, I used to initialize my
custom stuff in the connect method (I'm not sure if this strategy is valid
anyway), but for the Transformation connectors I'm not even sure if this
method is called.

You really aren't supposed to have a *single* object, but rather one per
connection instance.  Connection instances are long-lived, remember.  That
object should also expire eventually if there is no use.  There's a
particular design pattern you should try to adhere to, which is to have a
getSession() method that sets up your long-lived member object, and have
the poll() method free it after a certain amount of inactivity.  Pretty
much all connectors these days use this pattern; for a modern
implementation, have a look at the Jira connector.

Yes yes, of course. There would be a configuration object bounded to each
connector instance, of course. The problem I'm facing is that I want to
create this object only one time (could be perfectly a member of the
connector) and I can't find a proper way/place to do it because I need both
the configuration of the connector (ConfigParams, which is always
available, so that is fine) and the Specification object (which seems to
contain the job configuration data), which as far as I know is only passed
in the getPipelineDescriptionMethod. I would like not to do the
initialization in that method because it is called for each processed
documents and I would like to avoid typical hack like "if(customConfig ==
null) customConfig = new CustomConfig(params, specification);"

FWIW, there's no MCF in Action chapter on transformation connectors yet,
but they are quite similar to output connectors in many respects, so
reading Chapter 9 may help a bit.

Thanks,
Karl

Thanks to you Karl,

Cheers,
Rafa



On Tue, Jul 1, 2014 at 1:04 PM, Rafa Haro <[email protected]> wrote:

  Hi guys,

I'm trying to develop my first Transformation Connector. Before starting
to code, I have tried to read first enough documentation and I have also
studied the Tika extractor as transformation connector example.
Currently,
I'm just trying to implement an initial version of my connector, starting
with something simple to later complicate the things a little bit. The
first problem I'm facing is the configuration management, where I'm
probably missing something. In my case, I need a fixed configuration
while
creating an instance of the connector and a extended configuration per
job.
Let's say that the connector configuration has to setup a service and the
job configuration will define how the service should work for each job.
With both configurations, I need to create an object which is expensive
to
instantiate. Here is where the doubts raise:

1. I would like to initialize the configuration object only once per job
execution. Because the configuration is not supposed to be changed
during a
job execution, I would like to be able to take the configuration
parameters
from ConfigParams and from Specification objects and create a unique
instance of my configuration object.

2. The getPipelineDescription method is quite confusing for me. In the
Tika Extractor, this method is used to pack in a string the configuration
of the Tika processor. Then this string is again unpacked in the
addOrReplaceDocumentWithException method to read the documentation. My
question is why?. As far as I understand, the configuration can't change
during the job execution and according to the documentation "the contents
of the document cannot be considered by this method, and that a different
version string (defined in IRepositoryConnector) is used to describe the
version of the actual document". So, if only configuration data can be
used
to create the output version string, probably this version string can be
checked by the system before starting the job and not produced and
checked
per document because basically all the documents are going to produce the
same exact output version string. Probably I'm missing something but, for
example, looking at Tika Transformation connector seems to be pretty
clear
that there would be no difference between output version strings for all
the documents because it is using only configuration data to create the
string.

3.In the addOrReplaceDocumentWithException, why is the
pipelineDescription passed by parameter instead of the connector
Specification to ease the developer to access the configuration without
marshalling and unmarshalling it?

4. Is there a way to reuse a single configuration object per job
execution? In the Output processor connector, I used to initialize my
custom stuff in the connect method (I'm not sure if this strategy is
valid
anyway), but for the Transformation connectors I'm not even sure if this
method is called.

Thanks a lot for your help beforehand. Please note that the questions of
course are not intended to be criticism. This mail is just a dump of
doubts
that probably will help me to better understand the workflows in manifold

Re: Configuration Management at Transformation Connectors

Reply via email to