Hive treats "\\N" (note: 2 bytes: \ and N) as NULL values. Empty string is
still empty string, not NULL.


Zheng

On Thu, Apr 9, 2009 at 12:58 PM, Matt Pestritto <[email protected]> wrote:

> Zheng -
>
> That worked for me.  Thanks.  I thought I was taking care of that by
> checking for null.  Are null values used in Hive or should I just always
> check for empty string ?
>
> Thanks.
>
>
> On Thu, Apr 9, 2009 at 1:52 PM, Zheng Shao <[email protected]> wrote:
>
>>
>> 00:04:36,004 WARN  [JoinOperator] table 0 has more than joinEmitInterval 
>> rows for join key []
>>
>> The above might be the reason. In order to calculate the join result, we
>> need to cache all rows with a specific key from table 0 (customer in this
>> case) in memory. It seems that you have a lot of customers with empty
>> cust_id. If you do a subquery to filter it, it should solve the problem.
>>
>> select count(distinct c.email_address) from (select * from customer where
>> cust_id <> '') c join clickstream hd on (c.cust_id = hd.customer_id) where
>> c.cust_id is not null and hd.customer_id is not null and c.email_address is
>> not null;
>>
>>
>>
>> On Thu, Apr 9, 2009 at 6:54 AM, Matt Pestritto <[email protected]>wrote:
>>
>>> Hi.
>>> I'm running into a problem that I can't seem to figure out.  I'm running
>>> a hive query and the last reduce always fails.  Number of Reducers - 1
>>> always complete successfully.  If I run with 1 reducer, that reducer just
>>> fails and restarts.  If I run with 2 reducers, 1 always completes
>>> successfully and the last one running always fails.  It actually doesn't
>>> fail, it time-outs then gets re-submitted, rinse and repeat until manually
>>> killed.  I was originally getting errors that I ran out of heap space, so I
>>> set mapred.child.java.opts=-Xmx1G and now the task times out.
>>> mapred.task.timeout=600000
>>>
>>> Here is the query that is run:
>>> select count(distinct c.email_address) from customer c join clickstream
>>> hd on (c.cust_id = hd.customer_id) where c.cust_id is not null and
>>> hd.customer_id is not null and c.email_address is not null;
>>> customer = 17M records.  clickstream = 320M records.
>>>
>>> 6 node cluster running hadoop 17.2.1
>>>
>>> Something that I did notice, the reduce that timeouts and re-submits is
>>> the only reduce with the following warning: [JoinOperator] table 0 has more
>>> than joinEmitInterval rows for join key []
>>> I tried to back-into the code for that warning, but couldn't understand
>>> what was going on.
>>>
>>> Any ideas? or ways to get additional visibility into what is happening ?
>>>
>>> Thanks in advance.
>>> -Matt
>>>
>>>
>>> Logs:
>>>
>>> 00:04:30,380 INFO  [ReduceTask] task_200904061844_1501_r_000000_0 Copying 
>>> of all map outputs complete. Initiating the last merge on the remaining 
>>> files in ramfs://mapoutput249895724
>>> 00:04:33,929 INFO  [ReduceTask] task_200904061844_1501_r_000000_0 Merge of 
>>> the 39 files in InMemoryFileSystem complete. Local file is 
>>> /opt/hadoop-datastore/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_200904061844_1501/task_200904061844_1501_r_000000_0/output/map_2348.out
>>>
>>>
>>>
>>> 00:04:35,754 INFO  [JoinOperator] Initializing Self
>>> 00:04:35,754 INFO  [JoinOperator] Initializing children:
>>> 00:04:35,754 INFO  [FilterOperator] Initializing Self
>>> 00:04:35,754 INFO  [FilterOperator] Initializing children:
>>>
>>>
>>>
>>> 00:04:35,754 INFO  [GroupByOperator] Initializing Self
>>> 00:04:35,754 INFO  [GroupByOperator] Initializing children:
>>> 00:04:35,754 INFO  [FileSinkOperator] Initializing Self
>>> 00:04:35,875 INFO  [FileSinkOperator] Writing to temp file: 
>>> /tmp/hive-hadoop/194018189/_tmp.16462794.10002/_tmp.1501_r_000000_0
>>>
>>>
>>>
>>> 00:04:35,889 INFO  [GroupByOperator] Initialization Done
>>> 00:04:35,931 INFO  [FilterOperator] Initialization Done
>>> 00:04:35,933 INFO  [JoinOperator] Initialization Done
>>> 00:04:35,933 DEBUG [ExecReducer] Start Group
>>>
>>>
>>>
>>> 00:04:35,956 DEBUG [ExecReducer] End Group
>>> 00:04:35,957 DEBUG [ExecReducer] Start Group
>>> 00:04:36,004 WARN  [JoinOperator] table 0 has more than joinEmitInterval 
>>> rows for join key []
>>>
>>> ------------------------------
>>>
>>>
>>> *stderr logs*
>>>
>>> Exception in thread "Thread-2" java.util.ConcurrentModificationException
>>>     at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
>>>     at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
>>>
>>>
>>>
>>>     at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
>>>     at 
>>> org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
>>>     at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
>>>
>>>
>>>
>>>     at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
>>>     at 
>>> org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
>>>
>>>
>>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>


-- 
Yours,
Zheng

Reply via email to