Anshuman, For #1, is there a way from the content of a file destined for HDFS that you can tell whether a new partition will be introduced? If so, then after PutHDFS, you could make that decision (with a RouteOnContent or ExtractText->RouteOnAttribute), and for new partitions, you can route them to an ExecuteStreamCommand where you ignore STDIN and instead issue your hive repair command. This will make things more event-driven rather than having to periodically try a repair from a separate part of the flow.
For #2, we chose a separate set of Hive processors (rather than adding support for the Hive driver to the generic SQL processors) for at least two reasons. First, we wanted to include the Hive driver so the admin/user did not need to provide their own. The Hive dependencies are quite large and don't really belong in the standard NAR, so that lent itself to the need for a new NAR. Another reason is that the Hive JDBC driver does not support some methods that were deemed important and necessary for processors like ExecuteSQL, such as setQueryTimeout(). Rather than ignoring the provided value, the Hive driver throws a SQLException. Discussion in the community indicated that it was better (along with the first reason) to have separate processors rather than making the SQL processors behave inconsistently based on the level of driver support. Having said that, there is at least one Pull Request out there [1] that does exactly that, so perhaps it is time to revisit the discussion to see if there has been any change in opinions. Regards, Matt [1] https://github.com/apache/nifi/pull/1281 On Thu, Mar 23, 2017 at 5:47 AM, Anshuman Ghosh <[email protected]> wrote: > Hello Matt, > > Thank you for your reply! > > With "ExecuteProcess", I am able to execute a command (Hive/ Beeline) > Actually our use case is relatively simple, so if you have any other > suggestions that would be helpful. > > We are writing to a HDFS directory which is the location for an external > Hive table. However when we write/ introduce a new partition, we need to > execute a repair in order to update the metadata. I was wondering if there > is any better way to do this (apart from executing the command separately > through a processor and that too we are not sure about the frequency of this > execution) which you are aware of. > Can't we use generic JDBC driver to connect to Hive and execute commands > like we do for any other Database (like we did for PostgreSQL also) > > > Thanking you in advance! > > ______________________ > > Kind Regards, > Anshuman Ghosh > Contact - +49 179 9090964 > > > On Wed, Mar 22, 2017 at 6:27 PM, Matt Burgess <[email protected]> wrote: >> >> Anshuman, >> >> According to [1], it looks like CDH 5.10 also uses an Apache Hive >> 1.1.0 baseline, and looking through the changes [2] I didn't see >> anything related to the client_protocol field being added. You are >> right that ExecuteProcess should also work with a beeline command, the >> major difference is that ExecuteProcess does not accept an incoming >> flow file and ExecuteStreamCommand does. One thing I should mention, >> if your Hive query/statement is going to generate a lot of output (due >> to a long-running MapReduce job, for example), you may want to use the >> --silent command line option to suppress the output. Otherwise the >> ExecuteProcess and/or ExecuteStreamCommand processor have been known >> to hang on large outputs. >> >> Regards, >> Matt >> >> [1] >> https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_510.html >> [2] >> http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.10.0.CHANGES.txt?_ga=1.60219309.1838615776.1489495012 >> >> >> On Wed, Mar 22, 2017 at 12:42 PM, Anshuman Ghosh >> <[email protected]> wrote: >> > Hello Matt, >> > >> > Thank you very much for your reply! >> > I guess "ExecuteProcess" should also work with a beeline command? >> > However do you know whether CDH 5.10 is having higher Hive version or >> > not? >> > >> > Thanking you in advance! >> > >> > >> > ______________________ >> > >> > *Kind Regards,* >> > *Anshuman Ghosh* >> > *Contact - +49 179 9090964* >> > >> > >> > On Wed, Mar 22, 2017 at 4:43 PM, Matt Burgess <[email protected]> >> > wrote: >> > >> >> Anshuman, >> >> >> >> The Hive processors use Apache Hive 1.2.0, which is not compatible >> >> with Hive 1.1.0 and is thus a known issue against clusters that use >> >> Hive 1.1.0 such as CDH 5.9. Unfortunately there were API/code changes >> >> between Hive 1.1.0 and Hive 1.2.0, which means there is no simple >> >> workaround with respect to the Hive processors. The Hive NAR would >> >> have to be rebuilt (and its code changed) to use Hive 1.1.0. >> >> >> >> One possible workaround is to use ExecuteStreamCommand and the >> >> command-line hive client (hive, beeline, etc.) to execute HiveQL >> >> statements. This is not ideal but should work for getting the >> >> statements executed. >> >> >> >> Regards, >> >> Matt >> >> >> >> >> >> On Wed, Mar 22, 2017 at 11:34 AM, Anshuman Ghosh >> >> <[email protected]> wrote: >> >> > Hello everyone, >> >> > >> >> > I am trying to use this "PutHiveQL" processor. >> >> > However no luck with the connection string, seems like I am missing >> >> > out >> >> on >> >> > something. >> >> > >> >> > I am getting an error like "Required field 'client_protocol' is >> >> > unset!" >> >> > Please find the attachments for the error message and also config >> >> property. >> >> > >> >> > BTW, I am using Hive 1.1.0 which is packaged with CDH 5.9. Can that >> >> > be a >> >> > reason? >> >> > What would be the work around? >> >> > >> >> > >> >> > Thank >> >> > ing >> >> > you >> >> > in advance >> >> > ! >> >> > ______________________ >> >> > >> >> > Kind Regards, >> >> > Anshuman Ghosh >> >> > Contact - +49 179 9090964 >> >> > >> >> > >
