[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280726#comment-14280726 ] Sushanth Sowmyan commented on HIVE-6332: Thanks, and sorry for the delay on this - I was swamped at that time, and then this fell off my radar for a long while. HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278534#comment-14278534 ] Lefty Leverenz commented on HIVE-6332: -- +1 [~sushanth] finished the documentation in the Setup section of HCatalog Config Properties in the wiki, so this issue can be resolved as fixed. * [HCatalog Config Properties -- Setup | https://cwiki.apache.org/confluence/display/Hive/HCatalog+Config+Properties#HCatalogConfigProperties-Setup:] HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277837#comment-14277837 ] Sushanth Sowmyan commented on HIVE-6332: Re: hcat.dynamic.partitioning.custom.pattern - That is set on the JobConf by users of HCatOutputFormat. That is, it is job-level config for the users of HCatOutputFormat, this is either set in a pig script as a parameter to instantiate HCatStorer, or by a mapreduce user of HCatOutputFormat that provides it a JobConf object itself when they instantiate HCatOutputFormat through their code. These would not affect a typical hive user. HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189900#comment-14189900 ] Lefty Leverenz commented on HIVE-6332: -- [~sushanth], is this documentation finished or does it need more work? (See two comments back.) The only urgency is my desire to whittle down my to-do list. HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176075#comment-14176075 ] Lefty Leverenz commented on HIVE-6332: -- Pinging [~sushanth] and myself: This documentation might be finished, or might need final polish. Let's wrap it up and close the jira. HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959704#comment-13959704 ] Lefty Leverenz commented on HIVE-6332: -- [~sushanth], you could flesh out the introduction with instructions on how/where/when to set these properties. If they shouldn't be set by users, you could say they're generally set by administrators. A simple example or two would be helpful. Right now I'm documenting hcat.dynamic.partitioning.custom.pattern (HIVE-6109) but it isn't much use without information about how to set it. The jira description calls it a job config -- does that mean it can be set for a single CREATE/ALTER TABLE statement? Is that generally true of HCatConstants configs? HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922159#comment-13922159 ] Lefty Leverenz commented on HIVE-6332: -- Looks good overall. Of course I have some editorial nits, but they shouldn't clutter up this jira. One typo you could fix now: hcat.dynamic.partitioning.custom.patttern (triple t). An introduction would be helpful, mentioning the HCatConstants.java file and explaining basic usage. Why are cache parameters hcatalog.hive.xxx while all other parameters are hcat.xxx? (I'm asking about hcat vs. hcatalog, not the hive part.) This sentence in the first section confuses me: An override to specify where HCatStorer will write to, defined from pig jobs, either directly by user, or by using org.apache.hive.hcatalog.pig.HCatStorerWrapper. Does it mean that Pig jobs specify hcat.pig.storer.external.location? Could you give examples of specifying by user and by HCatStorerWrapper? In the Data Promotion section, this sentence seems a bit off: On the write side, it is expected that the user pass in valid HCatRecords with data correctly. Does that mean with data correctly typed for Hive? That's it for my first pass. I'll take another look later. HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923324#comment-13923324 ] Sushanth Sowmyan commented on HIVE-6332: Thanks, Eugene, for the accuracy check, and Thanks Lefty for the review. :) (more updates yet to come) HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923322#comment-13923322 ] Sushanth Sowmyan commented on HIVE-6332: + Added wiki link with original version in : https://cwiki.apache.org/confluence/display/Hive/HCatalog+Config+Properties + Fixed patttern Re: HCatConstants.java, I guess the question is whether this is user-facing (user of HCatalog, could be a developer of some other tool that uses HCat) or developer-facing (someone who's developing hive/hcat) - with this doc, I went for user-facing, and they might not need to know about HCatConstants, which I see as internal-facing. I have given it a token introduction though, and am wondering how I can flesh that out a bit better. Re: hcatalog.hive.xxx vs hcat.xxx : Oh shoot. Good catch, we should have caught that when hcatalog.hive.client.cache.expiry.time was added, but we didn't, and hcatalog.hive.client.cache.disabled, which was added recently followed that pattern. However, that makes that public api we should respect. I'll start another bug updating that and adding a deprecation cycle for it, but we'd get around to removing this only in hive-0.16 timeframe. Re: pig section, I'll try to clean that up a bit. HCatStorerWrapper does it automatically for the user, if they use it instead of HCatStorer, and if they use HCatStorer as is, but want to specify a custom location, they need to set that parameter in pig before invoking the storer. I'll put in some code examples. The Storage directives section in general could use code examples, I'll add them in. Re: Data promotion, yes, that's what that means. I'll edit it to reflect that better. HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921373#comment-13921373 ] Sushanth Sowmyan commented on HIVE-6332: Before I created a wiki page for this, I wanted to have the content checked/reviewed. [~leftylev], [~ekoifman], could you please go through the following and suggest edits/changes? Thanks! == HCatalog job properties: Storage directives: --- hcat.pig.storer.external.location : An override to specify where HCatStorer will write to, defined from pig jobs, either directly by user, or by using org.apache.hive.hcatalog.pig.HCatStorerWrapper. HCat will write to this specified directory, rather than writing to the table/partition directory specified/calculatable by the metadata. This will be used in lieu of the table directory if this is a table-level write (unpartitioned table write) or in lieu of the partition directory if this is a partition-level write. This parameter is used only for non-dynamic-partitioning jobs which have multiple write destinations. hcat.dynamic.partitioning.custom.pattern : For dynamic partitioning jobs, simply specifying a custom directory is not good enough, since it writes to multiple destinations, and thus, instead of a directory specification, it requires a pattern specification. That's where this parameter comes in. For example, if one had a table that was partitioned by keys country and state, with a root directory location of /apps/hive/warehouse/geo/ , then a dynamic partition write into it that writes partitions (country=US,state=CA) (country=IN,state=KA) would create two directories: /apps/hive/warehouse/geo/country=US/state=CA/ and /apps/hive/warehouse/geo/country=IN/state=KA/ . If we wanted a different patterned location, and specified hcat.dynamic.partitioning.custom.patttern=/ext/geo/${country}-${state}, it would create the following two partition dirs: /ext/geo/US-CA and /ext/geo/IN-KA . Thus, it allows us to specify a custom dir location pattern for all the writes, and will interpolate each variable it sees when attempting to create a destination location for the partitions. Cache behaviour directives: --- HCatalog maintains a cache of HiveClients to talk to the metastore, managing a cache of 1 metastore client per thread, defaulting to an expiry of 120 seconds. For people that wish to modify the behaviour of this cache, a few parameters are provided: hcatalog.hive.client.cache.expiry.time : Allows users to override the expiry time specified - this is an int, and specifies number of seconds. Default is 120. hcatalog.hive.client.cache.disabled : Default is false, allows people to disable the cache altogether if they wish to. This is useful in highly multithreaded usecases. Input Split Generation Behaviour: - hcat.desired.partition.num.splits : This is a hint/guidance that can be provided to HCatalog to pass on to underlying InputFormats, to produce a desired number of splits per partition. This is useful when we have a few large files and we want to increase parallelism by increasing the number of splits generated. It is not yet so useful in cases where we would want to reduce the number of splits for a large number of files. It is not at all useful, also, in cases where there are a large number of partitions that this job will read. Also note that this is merely an optimization hint, and it is not guaranteed that the underlying layer will be capable of using this optimization. Also, mapreduce parameters mapred.min.split.size and mapred.max.split.size can be used in conjunction with this parameter to tweak/optimize jobs. Data Promotion Behaviour: - In some cases where a user of HCat (such as some older versions of pig) does not support all the datatypes supported by hive, there are a few config parameters provided to handle data promotions/conversions to allow them to read data through HCatalog. On the write side, it is expected that the user pass in valid HCatRecords with data correctly. hcat.data.convert.boolean.to.integer : promotes boolean to int on read from HCatalog, defaults to false. hcat.data.tiny.small.int.promotion : promotes tinyint/smallint to int on read from HCatalog, defaults to false. HCatRecordReader Error Tolerance Behaviour: --- While reading, it is understandable that data might contain errors, but we may not want to completely abort a task due to a couple of errors. These parameters configure how many errors we can accept before we fail the task. hcat.input.bad.record.threshold : A float parameter, defaults to 0.0001f, which means we can deal with 1 error every 10,000 rows, and still not error out. Any greater, and we will. hcat.input.bad.record.min : An int
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921374#comment-13921374 ] Sushanth Sowmyan commented on HIVE-6332: Ugh, sorry about the formatting above, adding noformat: {noformat} HCatalog job properties: Storage directives: --- hcat.pig.storer.external.location : An override to specify where HCatStorer will write to, defined from pig jobs, either directly by user, or by using org.apache.hive.hcatalog.pig.HCatStorerWrapper. HCat will write to this specified directory, rather than writing to the table/partition directory specified/calculatable by the metadata. This will be used in lieu of the table directory if this is a table-level write (unpartitioned table write) or in lieu of the partition directory if this is a partition-level write. This parameter is used only for non-dynamic-partitioning jobs which have multiple write destinations. hcat.dynamic.partitioning.custom.pattern : For dynamic partitioning jobs, simply specifying a custom directory is not good enough, since it writes to multiple destinations, and thus, instead of a directory specification, it requires a pattern specification. That's where this parameter comes in. For example, if one had a table that was partitioned by keys country and state, with a root directory location of /apps/hive/warehouse/geo/ , then a dynamic partition write into it that writes partitions (country=US,state=CA) (country=IN,state=KA) would create two directories: /apps/hive/warehouse/geo/country=US/state=CA/ and /apps/hive/warehouse/geo/country=IN/state=KA/ . If we wanted a different patterned location, and specified hcat.dynamic.partitioning.custom.patttern=/ext/geo/${country}-${state}, it would create the following two partition dirs: /ext/geo/US-CA and /ext/geo/IN-KA . Thus, it allows us to specify a custom dir location pattern for all the writes, and will interpolate each variable it sees when attempting to create a destination location for the partitions. Cache behaviour directives: --- HCatalog maintains a cache of HiveClients to talk to the metastore, managing a cache of 1 metastore client per thread, defaulting to an expiry of 120 seconds. For people that wish to modify the behaviour of this cache, a few parameters are provided: hcatalog.hive.client.cache.expiry.time : Allows users to override the expiry time specified - this is an int, and specifies number of seconds. Default is 120. hcatalog.hive.client.cache.disabled : Default is false, allows people to disable the cache altogether if they wish to. This is useful in highly multithreaded usecases. Input Split Generation Behaviour: - hcat.desired.partition.num.splits : This is a hint/guidance that can be provided to HCatalog to pass on to underlying InputFormats, to produce a desired number of splits per partition. This is useful when we have a few large files and we want to increase parallelism by increasing the number of splits generated. It is not yet so useful in cases where we would want to reduce the number of splits for a large number of files. It is not at all useful, also, in cases where there are a large number of partitions that this job will read. Also note that this is merely an optimization hint, and it is not guaranteed that the underlying layer will be capable of using this optimization. Also, mapreduce parameters mapred.min.split.size and mapred.max.split.size can be used in conjunction with this parameter to tweak/optimize jobs. Data Promotion Behaviour: - In some cases where a user of HCat (such as some older versions of pig) does not support all the datatypes supported by hive, there are a few config parameters provided to handle data promotions/conversions to allow them to read data through HCatalog. On the write side, it is expected that the user pass in valid HCatRecords with data correctly. hcat.data.convert.boolean.to.integer : promotes boolean to int on read from HCatalog, defaults to false. hcat.data.tiny.small.int.promotion : promotes tinyint/smallint to int on read from HCatalog, defaults to false. HCatRecordReader Error Tolerance Behaviour: --- While reading, it is understandable that data might contain errors, but we may not want to completely abort a task due to a couple of errors. These parameters configure how many errors we can accept before we fail the task. hcat.input.bad.record.threshold : A float parameter, defaults to 0.0001f, which means we can deal with 1 error every 10,000 rows, and still not error out. Any greater, and we will. hcat.input.bad.record.min : An int parameter, defaults to 2, which is the minimum number of bad records we encounter before applying hcat.input.bad.record.threshold
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13921812#comment-13921812 ] Eugene Koifman commented on HIVE-6332: -- The section on Data Promotion looks fine to me. HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6332) HCatConstants Documentation needed
[ https://issues.apache.org/jira/browse/HIVE-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13902051#comment-13902051 ] Sushanth Sowmyan commented on HIVE-6332: Just commenting to note that I wasn't able to get to this last week or this, but I'll still definitely try to get this in before we fork for 0.13 HCatConstants Documentation needed -- Key: HIVE-6332 URL: https://issues.apache.org/jira/browse/HIVE-6332 Project: Hive Issue Type: Task Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan HCatConstants documentation is near non-existent, being defined only as comments in code for the various parameters. Given that a lot of api winds up being implemented as knobs that can be tweaked here, we should have a public facing doc for this. -- This message was sent by Atlassian JIRA (v6.1.5#6160)