I've avoided DIH like the plague since it really doesn't fit well in Solr,
so I'm still baffled as to why you think we need to use DIH as the
foundation for a Solr Morphlines project. That shouldn't stop you, but
what's the big impediment to taking a clean slate approach to Morphlines -
learn what we can from DIH, but do a fresh, clean "Solr 5.0" implementation
that is not burdened from the get-go with all of DIH's baggage?
Configuring DIH is one of its main problems, so blending Morphlines config
into DIH config would seem to just make Morphlines less attractive than it
actually is when viewed by itself.
You might also consider how ManifoldCF (another Apache project) would
integrate with DIH and Morphlines as well. I mean, the core use case is ETL
from external data sources. And how all of this relates to Apache Flume as
well.
But back to the original, still unanswered, question: Why use DIH as the
starting point for integrating Morphlines with Solr - unless the goal is to
make Morphlines unpalatable and less approachable than even DIH itself?!
Another question: What does Elasticsearch have in this area (besides
"rivers")? Are they headed in the Morphlines direction as well?
-- Jack Krupansky
-----Original Message-----
From: Alexandre Rafalovitch
Sent: Sunday, June 8, 2014 10:16 AM
To: [email protected]
Subject: Re: Adding Morphline support to DIH - worth the effort?
I see DIH as something that offers a quick way to get things done, as
long as they fit into DIH's couple of basic scenarios. Going even a
little beyond hits bugs, bad documentation, inconsistencies and lack
of ongoing support (e.g. SOLR-4383).
So, if it works for you - great. If it does not - too bad, use SolrJ.
And given what I observe, I believe the next round of improvements
might be easier to achieve by moving to a different open-source pipe
project than trying to keep reinventing and bandaging one of our own.
Go where strongest community is, etc.
Morphline can be seen as a replacement for DIH's EntityProcessors and
Transformers (Flume adds other bits). The reasons I think it is worth
looking at are as follows:
1) DIH is not really being maintained or further improved. So, the
list of EP and Transformers is the same and does not account for new
requests (which we see periodically on the mailing list); even the new
implementations get stuck in JIRA (see the JIRA in original email)
2) It's not terribly well documented either, so people are always
struggling to understand how the entity is actually generated and what
happens when things go wrong
3) We are already bundling Morphline jars with Solr. But we are NOT
using them in any way useful to a non-Hadoop Solr user. Which begs the
question why did we add them (one answer I guess: because we don't
have module system).
4) Morphlines have more primitives than DIH and the available list keeps
growing
5) What separate module for Solr? We have no discovery method for
modules. Writing one for general consumption is like trying to sing in
vacuum - the problem is a lot bigger that with individual offering.
In terms of implementation, I think it take defining a custom
MorphlineEntityProcessor which basically plugs into DIH's current
DataSources. So, one could use for example DIH SqlDataSource to get a
list of files and then to handoff to Morphline's black box to parse
those files into records (e.g. Multiline records), augment them, etc.
Then, at the end, this gets handed back to DIH to finish it up. I
think this would work even with nested entities and transformers. The
Admin UI should also work
Eventually, I think we need a harder discussion about DIH, so this
partial handover could be a way to test the waters.
Does this make more sense?
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency
On Sun, Jun 8, 2014 at 8:41 PM, Jack Krupansky <[email protected]>
wrote:
It sounds more like an alternative to DIH rather than an incremental
add-on
to DIH. I mean, isn't Morphline really just "a DIH for Hadoop"?
So, back to Shalin's question, which specific (please detail!) use cases
of
DIH are enhanced by Morphline?
Maybe it would help if you simply elaborate what benefits would accrue to
adding Morphline to DIH - as opposed to creating a separate module for
Solr.
I suppose it depends on whether you consider DIH a solid foundation or a
weak link in Solr that desperately needs firming up.
-- Jack Krupansky
-----Original Message----- From: Alexandre Rafalovitch
Sent: Sunday, June 8, 2014 1:40 AM
To: [email protected]
Subject: Re: Adding Morphline support to DIH - worth the effort?
Well, it's the same core scenario as DIH supports (apart from actual
data sources), but actively supported and developed by a company with
a lot more investment in it. For the primitives supported, see
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html
We don't bundle ALL of these with Solr, but I think we do bundle core,
solr-core and solr-cell packages, which is a good number and range of
functionality (e.g. readMultiLine).
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency
On Sun, Jun 8, 2014 at 12:23 PM, Shalin Shekhar Mangar
<[email protected]> wrote:
I do not know much about morphlines but I'd like to know what use-cases
would be possible/easier/faster with such an integration?
On Sun, Jun 8, 2014 at 10:32 AM, Alexandre Rafalovitch
<[email protected]>
wrote:
Hello,
I had a preliminary look around and it might be possible to plug
Morphline (already shipped with Solr) into DIH by creating a bridging
EntityProcessor.
Two questions:
1) Do people see value in it?
2) DIH is not very supported, so any addition seems to be a bit stuck
in "rickety bridge, don't rock" discussion (e.g. SOLR-4799). I don't
want to suddenly be responsible for fixing the bridge before adding a
standalone piece of code. So, if I write the code, how many general
DIH externalities would I also have to address (e.g. lack of tests,
etc)?
Regards,
Alex.
P.s. Morphline could also be integrated in update request processor
chain. So, that could be an alternative project.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--
Regards,
Shalin Shekhar Mangar.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]