Re: Error in "PutHiveQL" processor

Anshuman Ghosh Thu, 23 Mar 2017 09:54:06 -0700

Hello Matt,

Thank you so much for your detailed reply, I really appreciate it!


Yes in future we might have something in the content that can decide
whether there will be a new partition.
However as of now it is simply date - So whenever there is a change in date
it should be triggered.
Present scenario is whenever date changes, "putHDFS" automatically creates
a new directory and put data there. But can we have a logic to identify
this date change and trigger a new processor to run?


Thank you!

______________________

*Kind Regards,*
*Anshuman Ghosh*
*Contact - +49 179 9090964*


On Thu, Mar 23, 2017 at 4:43 PM, Matt Burgess <[email protected]> wrote:

> Anshuman,
>
> For #1, is there a way from the content of a file destined for HDFS
> that you can tell whether a new partition will be introduced? If so,
> then after PutHDFS, you could make that decision (with a
> RouteOnContent or ExtractText->RouteOnAttribute), and for new
> partitions, you can route them to an ExecuteStreamCommand where you
> ignore STDIN and instead issue your hive repair command. This will
> make things more event-driven rather than having to periodically try a
> repair from a separate part of the flow.
>
> For #2, we chose a separate set of Hive processors (rather than adding
> support for the Hive driver to the generic SQL processors) for at
> least two reasons. First, we wanted to include the Hive driver so the
> admin/user did not need to provide their own. The Hive dependencies
> are quite large and don't really belong in the standard NAR, so that
> lent itself to the need for a new NAR. Another reason is that the Hive
> JDBC driver does not support some methods that were deemed important
> and necessary for processors like ExecuteSQL, such as
> setQueryTimeout().  Rather than ignoring the provided value, the Hive
> driver throws a SQLException. Discussion in the community indicated
> that it was better (along with the first reason) to have separate
> processors rather than making the SQL processors behave inconsistently
> based on the level of driver support.  Having said that, there is at
> least one Pull Request out there [1] that does exactly that, so
> perhaps it is time to revisit the discussion to see if there has been
> any change in opinions.
>
> Regards,
> Matt
>
> [1] https://github.com/apache/nifi/pull/1281
>
>
> On Thu, Mar 23, 2017 at 5:47 AM, Anshuman Ghosh
> <[email protected]> wrote:
> > Hello Matt,
> >
> > Thank you for your reply!
> >
> > With "ExecuteProcess", I am able to execute a command (Hive/ Beeline)
> > Actually our use case is relatively simple, so if you have any other
> > suggestions that would be helpful.
> >
> > We are writing to a HDFS directory which is the location for an external
> > Hive table. However when we write/ introduce a new partition, we need to
> > execute a repair in order to update the metadata. I was wondering if
> there
> > is any better way to do this (apart from executing the command separately
> > through a processor and that too we are not sure about the frequency of
> this
> > execution) which you are aware of.
> > Can't we use generic JDBC driver to connect to Hive and execute commands
> > like we do for any other Database (like we did for PostgreSQL also)
> >
> >
> > Thanking you in advance!
> >
> > ______________________
> >
> > Kind Regards,
> > Anshuman Ghosh
> > Contact - +49 179 9090964
> >
> >
> > On Wed, Mar 22, 2017 at 6:27 PM, Matt Burgess <[email protected]>
> wrote:
> >>
> >> Anshuman,
> >>
> >> According to [1], it looks like CDH 5.10 also uses an Apache Hive
> >> 1.1.0 baseline, and looking through the changes [2] I didn't see
> >> anything related to the client_protocol field being added.  You are
> >> right that ExecuteProcess should also work with a beeline command, the
> >> major difference is that ExecuteProcess does not accept an incoming
> >> flow file and ExecuteStreamCommand does.  One thing I should mention,
> >> if your Hive query/statement is going to generate a lot of output (due
> >> to a long-running MapReduce job, for example), you may want to use the
> >> --silent command line option to suppress the output.  Otherwise the
> >> ExecuteProcess and/or ExecuteStreamCommand processor have been known
> >> to hang on large outputs.
> >>
> >> Regards,
> >> Matt
> >>
> >> [1]
> >> https://www.cloudera.com/documentation/enterprise/
> release-notes/topics/cdh_vd_cdh_package_tarball_510.html
> >> [2]
> >> http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.10.
> 0.CHANGES.txt?_ga=1.60219309.1838615776.1489495012
> >>
> >>
> >> On Wed, Mar 22, 2017 at 12:42 PM, Anshuman Ghosh
> >> <[email protected]> wrote:
> >> > Hello Matt,
> >> >
> >> > Thank you very much for your reply!
> >> > I guess "ExecuteProcess" should also work with a beeline command?
> >> > However do you know whether CDH 5.10 is having higher Hive version or
> >> > not?
> >> >
> >> > Thanking you in advance!
> >> >
> >> >
> >> > ______________________
> >> >
> >> > *Kind Regards,*
> >> > *Anshuman Ghosh*
> >> > *Contact - +49 179 9090964*
> >> >
> >> >
> >> > On Wed, Mar 22, 2017 at 4:43 PM, Matt Burgess <[email protected]>
> >> > wrote:
> >> >
> >> >> Anshuman,
> >> >>
> >> >> The Hive processors use Apache Hive 1.2.0, which is not compatible
> >> >> with Hive 1.1.0 and is thus a known issue against clusters that use
> >> >> Hive 1.1.0 such as CDH 5.9.  Unfortunately there were API/code
> changes
> >> >> between Hive 1.1.0 and Hive 1.2.0, which means there is no simple
> >> >> workaround with respect to the Hive processors. The Hive NAR would
> >> >> have to be rebuilt (and its code changed) to use Hive 1.1.0.
> >> >>
> >> >> One possible workaround is to use ExecuteStreamCommand and the
> >> >> command-line hive client (hive, beeline, etc.) to execute HiveQL
> >> >> statements. This is not ideal but should work for getting the
> >> >> statements executed.
> >> >>
> >> >> Regards,
> >> >> Matt
> >> >>
> >> >>
> >> >> On Wed, Mar 22, 2017 at 11:34 AM, Anshuman Ghosh
> >> >> <[email protected]> wrote:
> >> >> > Hello everyone,
> >> >> >
> >> >> > I am trying to use this "PutHiveQL" processor.
> >> >> > However no luck with the connection string, seems like I am missing
> >> >> > out
> >> >> on
> >> >> > something.
> >> >> >
> >> >> > I am getting an error like "Required field 'client_protocol' is
> >> >> > unset!"
> >> >> > Please find the attachments for the error message and also config
> >> >> property.
> >> >> >
> >> >> > BTW, I am using Hive 1.1.0 which is packaged with CDH 5.9. Can that
> >> >> > be a
> >> >> > reason?
> >> >> > What would be the work around?
> >> >> >
> >> >> >
> >> >> > Thank
> >> >> > ing
> >> >> > you
> >> >> > in advance
> >> >> > !
> >> >> > ______________________
> >> >> >
> >> >> > Kind Regards,
> >> >> > Anshuman Ghosh
> >> >> > Contact - +49 179 9090964
> >> >> >
> >> >>
> >
> >
>

Re: Error in "PutHiveQL" processor

Reply via email to