> One of the ideas over DIH discussed earlier is making it standalone.

Yeah; my beef with the DIH is that it’s tied to Solr.  But I’d rather see
something other than the DIH outside Solr; it’s not worthy IMO.  Why have
something Solr specific even?  A great pipeline shouldn’t tie itself to any
end-point.  There are a variety of solutions out there that I tried.  There
are the big 3 open-source ETLs: Kettle, Clover, Talend) and they aren’t
quite ideal in one way or another.  And Spring-Integration.  And some
half-baked data pipelines like OpenPipe & Open Pipeline.  I never got
around to taking a good look at Findwise’s open-sourced Hydra but I learned
enough to know to my surprise it was configured in code versus a config
file (like all the others) and that's a big turn-off to me.  Today I read
through most of the Morphlines docs and a few choice source files and I’m
super-impressed.  But as you note it’s missing a lot of other stuff.  I
think something great could be built using it as a core piece.

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Jun 8, 2014 at 5:51 PM, Mikhail Khludnev <[email protected]
> wrote:

> Jack,
> I found your considerations quite reasonable.
> One of the ideas over DIH discussed earlier is making it standalone. So,
> if we start from simple Morphline UI, we can do this extraction. Then, such
> externalized ETL, will work better with Solr Cloud than DIH works now.
> Presumably we can reuse DIH Jdbc Datasources as a source for Morphline
> records.
> Still open questions in this approach are:
> - joins/caching - seem possible with Morphlines but still there is no such
> command
> - delta import - scenario we don't need to forget to handle it
> - threads (it's completely out Morphline's concerns)
> - distributed processing - it would be great if we can partition
> datasource eg something what's done by Scoop
> ... what else?
>
>
> On Sun, Jun 8, 2014 at 6:54 PM, Jack Krupansky <[email protected]>
> wrote:
>
>> I've avoided DIH like the plague since it really doesn't fit well in
>> Solr, so I'm still baffled as to why you think we need to use DIH as the
>> foundation for a Solr Morphlines project. That shouldn't stop you, but
>> what's the big impediment to taking a clean slate approach to Morphlines -
>> learn what we can from DIH, but do a fresh, clean "Solr 5.0" implementation
>> that is not burdened from the get-go with all of DIH's baggage?
>>
>> Configuring DIH is one of its main problems, so blending Morphlines
>> config into DIH config would seem to just make Morphlines less attractive
>> than it actually is when viewed by itself.
>>
>> You might also consider how ManifoldCF (another Apache project) would
>> integrate with DIH and Morphlines as well. I mean, the core use case is ETL
>> from external data sources. And how all of this relates to Apache Flume as
>> well.
>>
>> But back to the original, still unanswered, question: Why use DIH as the
>> starting point for integrating Morphlines with Solr - unless the goal is to
>> make Morphlines unpalatable and less approachable than even DIH itself?!
>>
>> Another question: What does Elasticsearch have in this area (besides
>> "rivers")? Are they headed in the Morphlines direction as well?
>>
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Alexandre Rafalovitch
>> Sent: Sunday, June 8, 2014 10:16 AM
>>
>> To: [email protected]
>> Subject: Re: Adding Morphline support to DIH - worth the effort?
>>
>> I see DIH as something that offers a quick way to get things done, as
>> long as they fit into DIH's couple of basic scenarios. Going even a
>> little beyond hits bugs, bad documentation, inconsistencies and lack
>> of ongoing support (e.g. SOLR-4383).
>>
>> So, if it works for you - great. If it does not - too bad, use SolrJ.
>> And given what I observe, I believe the next round of improvements
>> might be easier to achieve by moving to a different open-source pipe
>> project than trying to keep reinventing and bandaging one of our own.
>> Go where strongest community is, etc.
>>
>> Morphline can be seen as a replacement for DIH's EntityProcessors and
>> Transformers (Flume adds other bits). The reasons I think it is worth
>> looking at are as follows:
>> 1) DIH is not really being maintained or further improved. So, the
>> list of EP and Transformers is the same and does not account for new
>> requests (which we see periodically on the mailing list); even the new
>> implementations get stuck in JIRA (see the JIRA in original email)
>> 2) It's not terribly well documented either, so people are always
>> struggling to understand how the entity is actually generated and what
>> happens when things go wrong
>> 3) We are already bundling Morphline jars with Solr. But we are NOT
>> using them in any way useful to a non-Hadoop Solr user. Which begs the
>> question why did we add them (one answer I guess: because we don't
>> have module system).
>> 4) Morphlines have more primitives than DIH and the available list keeps
>> growing
>> 5) What separate module for Solr? We have no discovery method for
>> modules. Writing one for general consumption is like trying to sing in
>> vacuum - the problem is a lot bigger that with individual offering.
>>
>> In terms of implementation, I think it take defining a custom
>> MorphlineEntityProcessor which basically plugs into DIH's current
>> DataSources. So, one could use for example DIH SqlDataSource to get a
>> list of files and then to handoff to Morphline's black box to parse
>> those files into records (e.g. Multiline records), augment them, etc.
>> Then, at the end, this gets handed back to DIH to finish it up. I
>> think this would work even with nested entities and transformers. The
>> Admin UI should also work
>>
>> Eventually, I think we need a harder discussion about DIH, so this
>> partial handover could be a way to test the waters.
>>
>> Does this make more sense?
>>
>> Regards,
>>   Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>>
>>
>> On Sun, Jun 8, 2014 at 8:41 PM, Jack Krupansky <[email protected]>
>> wrote:
>>
>>> It sounds more like an alternative to DIH rather than an incremental
>>> add-on
>>> to DIH. I mean, isn't Morphline really just "a DIH for Hadoop"?
>>>
>>> So, back to Shalin's question, which specific (please detail!) use cases
>>> of
>>> DIH are enhanced by Morphline?
>>>
>>> Maybe it would help if you simply elaborate what benefits would accrue to
>>> adding Morphline to DIH - as opposed to creating a separate module for
>>> Solr.
>>> I suppose it depends on whether you consider DIH a solid foundation or a
>>> weak link in Solr that desperately needs firming up.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Alexandre Rafalovitch
>>> Sent: Sunday, June 8, 2014 1:40 AM
>>> To: [email protected]
>>> Subject: Re: Adding Morphline support to DIH - worth the effort?
>>>
>>>
>>> Well, it's the same core scenario as DIH supports (apart from actual
>>> data sources), but actively supported and developed by a company with
>>> a lot more investment in it. For the primitives supported, see
>>> http://cloudera.github.io/cdk/docs/current/cdk-morphlines/
>>> morphlinesReferenceGuide.html
>>>
>>> We don't bundle ALL of these with Solr, but I think we do bundle core,
>>> solr-core and solr-cell packages, which is a good number and range of
>>> functionality (e.g. readMultiLine).
>>>
>>> Regards,
>>>   Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>>> proficiency
>>>
>>>
>>> On Sun, Jun 8, 2014 at 12:23 PM, Shalin Shekhar Mangar
>>> <[email protected]> wrote:
>>>
>>>>
>>>> I do not know much about morphlines but I'd like to know what use-cases
>>>> would be possible/easier/faster with such an integration?
>>>>
>>>>
>>>> On Sun, Jun 8, 2014 at 10:32 AM, Alexandre Rafalovitch
>>>> <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> I had a preliminary look around and it might be possible to plug
>>>>> Morphline (already shipped with Solr) into DIH by creating a bridging
>>>>> EntityProcessor.
>>>>>
>>>>> Two questions:
>>>>> 1) Do people see value in it?
>>>>> 2) DIH is not very supported, so any addition seems to be a bit stuck
>>>>> in "rickety bridge, don't rock" discussion (e.g. SOLR-4799). I don't
>>>>> want to suddenly be responsible for fixing the bridge before adding a
>>>>> standalone piece of code. So, if I write the code, how many general
>>>>> DIH externalities would I also have to address (e.g. lack of tests,
>>>>> etc)?
>>>>>
>>>>> Regards,
>>>>>    Alex.
>>>>> P.s. Morphline could also be integrated in update request processor
>>>>> chain. So, that could be an alternative project.
>>>>>
>>>>> Personal website: http://www.outerthoughts.com/
>>>>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>>>>> proficiency
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Shalin Shekhar Mangar.
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <[email protected]>
>

Reply via email to