Hive treats "\\N" (note: 2 bytes: \ and N) as NULL values. Empty string is still empty string, not NULL.
Zheng On Thu, Apr 9, 2009 at 12:58 PM, Matt Pestritto <[email protected]> wrote: > Zheng - > > That worked for me. Thanks. I thought I was taking care of that by > checking for null. Are null values used in Hive or should I just always > check for empty string ? > > Thanks. > > > On Thu, Apr 9, 2009 at 1:52 PM, Zheng Shao <[email protected]> wrote: > >> >> 00:04:36,004 WARN [JoinOperator] table 0 has more than joinEmitInterval >> rows for join key [] >> >> The above might be the reason. In order to calculate the join result, we >> need to cache all rows with a specific key from table 0 (customer in this >> case) in memory. It seems that you have a lot of customers with empty >> cust_id. If you do a subquery to filter it, it should solve the problem. >> >> select count(distinct c.email_address) from (select * from customer where >> cust_id <> '') c join clickstream hd on (c.cust_id = hd.customer_id) where >> c.cust_id is not null and hd.customer_id is not null and c.email_address is >> not null; >> >> >> >> On Thu, Apr 9, 2009 at 6:54 AM, Matt Pestritto <[email protected]>wrote: >> >>> Hi. >>> I'm running into a problem that I can't seem to figure out. I'm running >>> a hive query and the last reduce always fails. Number of Reducers - 1 >>> always complete successfully. If I run with 1 reducer, that reducer just >>> fails and restarts. If I run with 2 reducers, 1 always completes >>> successfully and the last one running always fails. It actually doesn't >>> fail, it time-outs then gets re-submitted, rinse and repeat until manually >>> killed. I was originally getting errors that I ran out of heap space, so I >>> set mapred.child.java.opts=-Xmx1G and now the task times out. >>> mapred.task.timeout=600000 >>> >>> Here is the query that is run: >>> select count(distinct c.email_address) from customer c join clickstream >>> hd on (c.cust_id = hd.customer_id) where c.cust_id is not null and >>> hd.customer_id is not null and c.email_address is not null; >>> customer = 17M records. clickstream = 320M records. >>> >>> 6 node cluster running hadoop 17.2.1 >>> >>> Something that I did notice, the reduce that timeouts and re-submits is >>> the only reduce with the following warning: [JoinOperator] table 0 has more >>> than joinEmitInterval rows for join key [] >>> I tried to back-into the code for that warning, but couldn't understand >>> what was going on. >>> >>> Any ideas? or ways to get additional visibility into what is happening ? >>> >>> Thanks in advance. >>> -Matt >>> >>> >>> Logs: >>> >>> 00:04:30,380 INFO [ReduceTask] task_200904061844_1501_r_000000_0 Copying >>> of all map outputs complete. Initiating the last merge on the remaining >>> files in ramfs://mapoutput249895724 >>> 00:04:33,929 INFO [ReduceTask] task_200904061844_1501_r_000000_0 Merge of >>> the 39 files in InMemoryFileSystem complete. Local file is >>> /opt/hadoop-datastore/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_200904061844_1501/task_200904061844_1501_r_000000_0/output/map_2348.out >>> >>> >>> >>> 00:04:35,754 INFO [JoinOperator] Initializing Self >>> 00:04:35,754 INFO [JoinOperator] Initializing children: >>> 00:04:35,754 INFO [FilterOperator] Initializing Self >>> 00:04:35,754 INFO [FilterOperator] Initializing children: >>> >>> >>> >>> 00:04:35,754 INFO [GroupByOperator] Initializing Self >>> 00:04:35,754 INFO [GroupByOperator] Initializing children: >>> 00:04:35,754 INFO [FileSinkOperator] Initializing Self >>> 00:04:35,875 INFO [FileSinkOperator] Writing to temp file: >>> /tmp/hive-hadoop/194018189/_tmp.16462794.10002/_tmp.1501_r_000000_0 >>> >>> >>> >>> 00:04:35,889 INFO [GroupByOperator] Initialization Done >>> 00:04:35,931 INFO [FilterOperator] Initialization Done >>> 00:04:35,933 INFO [JoinOperator] Initialization Done >>> 00:04:35,933 DEBUG [ExecReducer] Start Group >>> >>> >>> >>> 00:04:35,956 DEBUG [ExecReducer] End Group >>> 00:04:35,957 DEBUG [ExecReducer] Start Group >>> 00:04:36,004 WARN [JoinOperator] table 0 has more than joinEmitInterval >>> rows for join key [] >>> >>> ------------------------------ >>> >>> >>> *stderr logs* >>> >>> Exception in thread "Thread-2" java.util.ConcurrentModificationException >>> at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100) >>> at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154) >>> >>> >>> >>> at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217) >>> at >>> org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214) >>> at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324) >>> >>> >>> >>> at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224) >>> at >>> org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209) >>> >>> >>> >> >> >> -- >> Yours, >> Zheng >> > > -- Yours, Zheng
