[jira] [Commented] (HIVE-5020) HCat reading null-key map entries causes NPE

2014-07-02 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050866#comment-14050866
 ] 

Daniel Dai commented on HIVE-5020:
--

I find the following comments in LazyMap.java: 
// LazyMap stores a map of Primitive LazyObjects to LazyObjects. Note that the
// keys of the map cannot contain null.

This could be the reason when I try to load null map key from RC file, I end up 
with an infinite loop. 

To be safe, it seems we shall disallow null map key. Even if we fix LazyMap, 
there could be other places we made the same assumption.

 HCat reading null-key map entries causes NPE
 

 Key: HIVE-5020
 URL: https://issues.apache.org/jira/browse/HIVE-5020
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan

 Currently, if someone has a null key in a map, HCatInputFormat will terminate 
 with an NPE while trying to read it.
 {noformat}
 java.lang.NullPointerException
 at java.lang.String.compareTo(String.java:1167)
 at java.lang.String.compareTo(String.java:92)
 at java.util.TreeMap.put(TreeMap.java:545)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
 at 
 org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
 {noformat}
 This is because we use a TreeMap to preserve order of elements in the map 
 when reading from the underlying storage/serde.
 This problem is easily fixed in a number of ways:
 a) Switch to HashMap, which allows null keys. That does not preserve order of 
 keys, which should not be important for map fields, but if we desire that, we 
 have a solution for that too - LinkedHashMap, which would both retain order 
 and allow us to insert null keys into the map.
 b) Ignore null keyed entries - check if the field we read is null, and if it 
 is, then ignore that item in the record altogether. This way, HCat is robust 
 in what it does - it does not terminate with an NPE, and it does not allow 
 null keys in maps that might be problematic to layers above us that are not 
 used to seeing nulls as keys in maps.
 Why do I bring up the second fix? First, I bring it up because of the way we 
 discovered this bug. When reading from an RCFile, we do not notice this bug. 
 If the same query that produced the RCFile instead produces an Orcfile, and 
 we try reading from it, we see this problem.
 RCFile seems to be quietly stripping any null key entries, whereas Orc 
 retains them. This is why we didn't notice this problem for a long while, and 
 suddenly, now, we are. Now, if we fix our code to allow nulls in map keys 
 through to layers above, we expose layers above to this change, which may 
 then cause them to break. (Technically, this is stretching the case because 
 we already break now if they care) More importantly, though, we have a case 
 now, where the same data will be exposed differently if it were stored as orc 
 or if it were stored as rcfile. And as a layer that is supposed to make 
 storage invisible to the end user, HCat should attempt to provide some 
 consistency in how data behaves to the end user.
 Secondly, whether or not nulls should be supported as keys in Maps seems to 
 be almost a religious view. Some people see it from a perspective of a 
 mapping, which lends itself to a Sure, if we encounter a null, we map to 
 this other value kind of a view, whereas other people view it from a lookup 
 index kind of view, which lends itself to a null as a key makes no sense - 
 What kind of lookup do you expect to perform? kind of view. Both views have 
 their points, and it makes sense to see if we need to support it.
 That said...
 There is another important concern at hand here: nulls in map keys might be 
 due to bad data(corruption or loading error), and by stripping them, we might 
 be silently hiding that from the user. So silent stripping is bad. This is 
 an important point that does steer me towards the former approach, of passing 
 it on to layers above, and standardize on an understanding that null keys in 
 maps are acceptable data that layers above us have to handle. After that, it 
 could be taken on as a further consistency fix, to fix RCFile so that it 
 allows nulls in map keys.
 Having gone through this discussion of standardization, another important 
 question is whether or not there is actually a use-case for null keys in maps 
 in data. If there isn't, maybe we shouldn't allow writing that in the first 
 place, and both orc and rcfile must simply error out to the end user if they 

[jira] [Commented] (HIVE-5020) HCat reading null-key map entries causes NPE

2014-07-01 Thread Sushanth Sowmyan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049238#comment-14049238
 ] 

Sushanth Sowmyan commented on HIVE-5020:


Sorry for the late response to this jira, and thanks for the input, all. I'd 
initially wanted to give it time for more people to respond, and then this fell 
by the wayside.

Thrift structures do not support map null keys. I agree that sortedness is not 
important for maps, and in fact, we should not guarantee it for something 
that's just called a map.

And while I'd like to see a usecase for nulls in keys supported, it looks like 
the conventional hive semantics for maps ignores null keys, and changing rcfile 
users so that they suddenly start getting null keys is a recipe for trouble for 
a lot of users. So having orc map to rc behaviour, and make that the standard 
hive behaviour might make more sense. [~owen.omalley]/[~prasanth_j], could 
you comment on what you think the impact of changing orc behaviour that way 
might be?

HCat should adopt whatever behaviour we standardize on for hive, and can follow 
after that.

 HCat reading null-key map entries causes NPE
 

 Key: HIVE-5020
 URL: https://issues.apache.org/jira/browse/HIVE-5020
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan

 Currently, if someone has a null key in a map, HCatInputFormat will terminate 
 with an NPE while trying to read it.
 {noformat}
 java.lang.NullPointerException
 at java.lang.String.compareTo(String.java:1167)
 at java.lang.String.compareTo(String.java:92)
 at java.util.TreeMap.put(TreeMap.java:545)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
 at 
 org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
 {noformat}
 This is because we use a TreeMap to preserve order of elements in the map 
 when reading from the underlying storage/serde.
 This problem is easily fixed in a number of ways:
 a) Switch to HashMap, which allows null keys. That does not preserve order of 
 keys, which should not be important for map fields, but if we desire that, we 
 have a solution for that too - LinkedHashMap, which would both retain order 
 and allow us to insert null keys into the map.
 b) Ignore null keyed entries - check if the field we read is null, and if it 
 is, then ignore that item in the record altogether. This way, HCat is robust 
 in what it does - it does not terminate with an NPE, and it does not allow 
 null keys in maps that might be problematic to layers above us that are not 
 used to seeing nulls as keys in maps.
 Why do I bring up the second fix? First, I bring it up because of the way we 
 discovered this bug. When reading from an RCFile, we do not notice this bug. 
 If the same query that produced the RCFile instead produces an Orcfile, and 
 we try reading from it, we see this problem.
 RCFile seems to be quietly stripping any null key entries, whereas Orc 
 retains them. This is why we didn't notice this problem for a long while, and 
 suddenly, now, we are. Now, if we fix our code to allow nulls in map keys 
 through to layers above, we expose layers above to this change, which may 
 then cause them to break. (Technically, this is stretching the case because 
 we already break now if they care) More importantly, though, we have a case 
 now, where the same data will be exposed differently if it were stored as orc 
 or if it were stored as rcfile. And as a layer that is supposed to make 
 storage invisible to the end user, HCat should attempt to provide some 
 consistency in how data behaves to the end user.
 Secondly, whether or not nulls should be supported as keys in Maps seems to 
 be almost a religious view. Some people see it from a perspective of a 
 mapping, which lends itself to a Sure, if we encounter a null, we map to 
 this other value kind of a view, whereas other people view it from a lookup 
 index kind of view, which lends itself to a null as a key makes no sense - 
 What kind of lookup do you expect to perform? kind of view. Both views have 
 their points, and it makes sense to see if we need to support it.
 That said...
 There is another important concern at hand here: nulls in map keys might be 
 due to bad data(corruption or loading error), and by stripping them, we might 
 be silently hiding that from the user. So silent stripping is bad. This is 
 an important point that does steer me towards the former approach, of passing 
 it on to layers above, and 

[jira] [Commented] (HIVE-5020) HCat reading null-key map entries causes NPE

2013-08-14 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740218#comment-13740218
 ] 

Alan Gates commented on HIVE-5020:
--

I tend to agree with David that having a NULL key is dubious at best.  Though 
it's worth pointing out that SQL's semantics around NULL aren't consistent 
(NULL = NULL does not evaluate to true (nor false) but when grouping NULLs are 
collected together).  

The real question from an HCat perspective is What does Hive do?  If Hive 
allows you to have null keys (with ORC), then HCat should.  The question of 
whether Hive should have that is a bigger topic for a different day. 

 HCat reading null-key map entries causes NPE
 

 Key: HIVE-5020
 URL: https://issues.apache.org/jira/browse/HIVE-5020
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan

 Currently, if someone has a null key in a map, HCatInputFormat will terminate 
 with an NPE while trying to read it.
 {noformat}
 java.lang.NullPointerException
 at java.lang.String.compareTo(String.java:1167)
 at java.lang.String.compareTo(String.java:92)
 at java.util.TreeMap.put(TreeMap.java:545)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
 at 
 org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
 {noformat}
 This is because we use a TreeMap to preserve order of elements in the map 
 when reading from the underlying storage/serde.
 This problem is easily fixed in a number of ways:
 a) Switch to HashMap, which allows null keys. That does not preserve order of 
 keys, which should not be important for map fields, but if we desire that, we 
 have a solution for that too - LinkedHashMap, which would both retain order 
 and allow us to insert null keys into the map.
 b) Ignore null keyed entries - check if the field we read is null, and if it 
 is, then ignore that item in the record altogether. This way, HCat is robust 
 in what it does - it does not terminate with an NPE, and it does not allow 
 null keys in maps that might be problematic to layers above us that are not 
 used to seeing nulls as keys in maps.
 Why do I bring up the second fix? First, I bring it up because of the way we 
 discovered this bug. When reading from an RCFile, we do not notice this bug. 
 If the same query that produced the RCFile instead produces an Orcfile, and 
 we try reading from it, we see this problem.
 RCFile seems to be quietly stripping any null key entries, whereas Orc 
 retains them. This is why we didn't notice this problem for a long while, and 
 suddenly, now, we are. Now, if we fix our code to allow nulls in map keys 
 through to layers above, we expose layers above to this change, which may 
 then cause them to break. (Technically, this is stretching the case because 
 we already break now if they care) More importantly, though, we have a case 
 now, where the same data will be exposed differently if it were stored as orc 
 or if it were stored as rcfile. And as a layer that is supposed to make 
 storage invisible to the end user, HCat should attempt to provide some 
 consistency in how data behaves to the end user.
 Secondly, whether or not nulls should be supported as keys in Maps seems to 
 be almost a religious view. Some people see it from a perspective of a 
 mapping, which lends itself to a Sure, if we encounter a null, we map to 
 this other value kind of a view, whereas other people view it from a lookup 
 index kind of view, which lends itself to a null as a key makes no sense - 
 What kind of lookup do you expect to perform? kind of view. Both views have 
 their points, and it makes sense to see if we need to support it.
 That said...
 There is another important concern at hand here: nulls in map keys might be 
 due to bad data(corruption or loading error), and by stripping them, we might 
 be silently hiding that from the user. So silent stripping is bad. This is 
 an important point that does steer me towards the former approach, of passing 
 it on to layers above, and standardize on an understanding that null keys in 
 maps are acceptable data that layers above us have to handle. After that, it 
 could be taken on as a further consistency fix, to fix RCFile so that it 
 allows nulls in map keys.
 Having gone through this discussion of standardization, another important 
 question is whether or not there is actually a use-case for null keys in maps 
 in data. If there isn't, maybe we shouldn't allow writing that in 

[jira] [Commented] (HIVE-5020) HCat reading null-key map entries causes NPE

2013-08-11 Thread David Schorow (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736421#comment-13736421
 ] 

David Schorow commented on HIVE-5020:
-

It sounds like b is the simplest and safest solution and would effectively make 
ORC work consistently with RC files.  

In Data Base semantics, NULL is a special value, so for example, NULL does not 
match NULL.  Hence I don't know what it means to have a NULL key, or to try to 
do a lookup with a NULL key.  I think such a lookup should never return a 
value, regardless of what is in the map.  It may be okay to have NULL values in 
a map.

 HCat reading null-key map entries causes NPE
 

 Key: HIVE-5020
 URL: https://issues.apache.org/jira/browse/HIVE-5020
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan

 Currently, if someone has a null key in a map, HCatInputFormat will terminate 
 with an NPE while trying to read it.
 {noformat}
 java.lang.NullPointerException
 at java.lang.String.compareTo(String.java:1167)
 at java.lang.String.compareTo(String.java:92)
 at java.util.TreeMap.put(TreeMap.java:545)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
 at 
 org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
 {noformat}
 This is because we use a TreeMap to preserve order of elements in the map 
 when reading from the underlying storage/serde.
 This problem is easily fixed in a number of ways:
 a) Switch to HashMap, which allows null keys. That does not preserve order of 
 keys, which should not be important for map fields, but if we desire that, we 
 have a solution for that too - LinkedHashMap, which would both retain order 
 and allow us to insert null keys into the map.
 b) Ignore null keyed entries - check if the field we read is null, and if it 
 is, then ignore that item in the record altogether. This way, HCat is robust 
 in what it does - it does not terminate with an NPE, and it does not allow 
 null keys in maps that might be problematic to layers above us that are not 
 used to seeing nulls as keys in maps.
 Why do I bring up the second fix? First, I bring it up because of the way we 
 discovered this bug. When reading from an RCFile, we do not notice this bug. 
 If the same query that produced the RCFile instead produces an Orcfile, and 
 we try reading from it, we see this problem.
 RCFile seems to be quietly stripping any null key entries, whereas Orc 
 retains them. This is why we didn't notice this problem for a long while, and 
 suddenly, now, we are. Now, if we fix our code to allow nulls in map keys 
 through to layers above, we expose layers above to this change, which may 
 then cause them to break. (Technically, this is stretching the case because 
 we already break now if they care) More importantly, though, we have a case 
 now, where the same data will be exposed differently if it were stored as orc 
 or if it were stored as rcfile. And as a layer that is supposed to make 
 storage invisible to the end user, HCat should attempt to provide some 
 consistency in how data behaves to the end user.
 Secondly, whether or not nulls should be supported as keys in Maps seems to 
 be almost a religious view. Some people see it from a perspective of a 
 mapping, which lends itself to a Sure, if we encounter a null, we map to 
 this other value kind of a view, whereas other people view it from a lookup 
 index kind of view, which lends itself to a null as a key makes no sense - 
 What kind of lookup do you expect to perform? kind of view. Both views have 
 their points, and it makes sense to see if we need to support it.
 That said...
 There is another important concern at hand here: nulls in map keys might be 
 due to bad data(corruption or loading error), and by stripping them, we might 
 be silently hiding that from the user. So silent stripping is bad. This is 
 an important point that does steer me towards the former approach, of passing 
 it on to layers above, and standardize on an understanding that null keys in 
 maps are acceptable data that layers above us have to handle. After that, it 
 could be taken on as a further consistency fix, to fix RCFile so that it 
 allows nulls in map keys.
 Having gone through this discussion of standardization, another important 
 question is whether or not there is actually a use-case for null keys in maps 
 in data. If there isn't, maybe we shouldn't allow writing that in the first 
 place, and both orc 

[jira] [Commented] (HIVE-5020) HCat reading null-key map entries causes NPE

2013-08-07 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732719#comment-13732719
 ] 

Edward Capriolo commented on HIVE-5020:
---

If I had to hazard a guess I would say that the original implementation was 
about supporting thrift structures. Possibly if thrift does not support this 
case that design was not carried over.

Personally I think we SHOULD support NULL key and NULL value in maps. The map 
need not be sorted.

 HCat reading null-key map entries causes NPE
 

 Key: HIVE-5020
 URL: https://issues.apache.org/jira/browse/HIVE-5020
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Sushanth Sowmyan
Assignee: Sushanth Sowmyan

 Currently, if someone has a null key in a map, HCatInputFormat will terminate 
 with an NPE while trying to read it.
 {noformat}
 java.lang.NullPointerException
 at java.lang.String.compareTo(String.java:1167)
 at java.lang.String.compareTo(String.java:92)
 at java.util.TreeMap.put(TreeMap.java:545)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
 at 
 org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
 at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
 at 
 org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
 {noformat}
 This is because we use a TreeMap to preserve order of elements in the map 
 when reading from the underlying storage/serde.
 This problem is easily fixed in a number of ways:
 a) Switch to HashMap, which allows null keys. That does not preserve order of 
 keys, which should not be important for map fields, but if we desire that, we 
 have a solution for that too - LinkedHashMap, which would both retain order 
 and allow us to insert null keys into the map.
 b) Ignore null keyed entries - check if the field we read is null, and if it 
 is, then ignore that item in the record altogether. This way, HCat is robust 
 in what it does - it does not terminate with an NPE, and it does not allow 
 null keys in maps that might be problematic to layers above us that are not 
 used to seeing nulls as keys in maps.
 Why do I bring up the second fix? I bring it up because of the way we 
 discovered this bug. When reading from an RCFile, we do not notice this bug. 
 If the same query that produced the RCFile instead produces an Orcfile, and 
 we try reading from it, we see this problem.
 RCFile seems to be quietly stripping any null key entries, whereas Orc 
 retains them. This is why we didn't notice this problem for a long while, and 
 suddenly, now, we are. Now, if we fix our code to allow nulls in map keys 
 through to layers above, we expose layers above to this change, which may 
 then cause them to break. (Technically, this is stretching the case because 
 we already break now if they care) More importantly, though, we have a case 
 now, where the same data will be exposed differently if it were stored as orc 
 or if it were stored as rcfile. And as a layer that is supposed to make 
 storage invisible to the end user, HCat should attempt to provide some 
 consistency in how data behaves to the end user.
 That said...
 There is another important concern at hand here: nulls in map keys might be 
 due to bad data(corruption or loading error), and by stripping them, we might 
 be silently hiding that from the user. This is an important point that does 
 steer me towards the former approach, of passing it on to layers above, and 
 standardize on an understanding that null keys in maps are acceptable data 
 that layers above us have to handle. After that, it could be taken on as a 
 further consistency fix, to fix RCFile so that it allows nulls in map keys.
 Having gone through this discussion of standardization, another important 
 question is whether or not there is actually a use-case for null keys in maps 
 in data. If there isn't, maybe we shouldn't allow writing that in the first 
 place, and both orc and rcfile must simply error out to the end user if they 
 try to write a  null map key? Well, it is true that it is possible that data 
 errors lead to null keys, but it's also possible that the user wants to store 
 a mapping for value transformations, and they might have a transformation for 
 null as well. In the case I encountered it, they were writing out an 
 intermediate table after having read from a sparse table using a custom input 
 format that generated an arbitrary number of columns, and were using the map 
 to store column name mappings that would eventually be written out to another 
 table. That seems a valid use, and we shouldn't prevent users from this sort 
 of usage.
 Another