Re: LeaseExpiredException while test data creation on ppc64le

Nishidha Panpaliya Thu, 10 Mar 2016 09:39:12 -0800

Hi Tim,

I'm running a debug build.


Also, I could locate the crash point in MurmurHash2_64 function for a
specific test data. I could reproduce segmentation fault in my standalone
test application that tests this hashing algorithm.
But even after handling the crashing point in Impala code, I'm still
getting a new error as below -

   Data Loading from Impala failed with error: ImpalaBeeswaxException:
    INNER EXCEPTION: <class 'socket.error'>
    MESSAGE: [Errno 104] Connection reset by peer

Still debugging the issue.

Thanks,
Nishidha




From:   Tim Armstrong <[email protected]>
To:     Nishidha Panpaliya/Austin/Contr/IBM@IBMUS
Cc:     [email protected], Sudarshan
            Jagadale/Austin/Contr/IBM@IBMUS, nishidha randad
            <[email protected]>
Date:   03/08/2016 10:55 PM
Subject:        Re: LeaseExpiredException while test data creation on ppc64le



Oh, are you running with a debug build or a release build by the way?

On Tue, Mar 8, 2016 at 8:58 AM, Tim Armstrong <[email protected]>
wrote:
  I don't think we've seen that crash before. It looks like it's
  dereferencing a pointer thatis causing the crash. Tracing back through
  the callstack, it looks like somehow the expression below is constructing
  a StringValue that is causing a segfault when dereferenced
  (0x000000ffff9b6aa0).

  inline Status HdfsParquetTableWriter::BaseColumnWriter::AppendRow
  (TupleRow* row) {
    ++num_values_;
    void* value = expr_ctx_->GetValue(row); <==

  I'm not sure why it would be returning an invalid pointer. It not a NULL
  pointer and looks possibly valid and 16-byte allgned. If you have a core
  dump it would be interesting to know if that pointer is pointing into
  invalid memory or if something else is going on.

  > Thanks a lot Tim! I did check some of the impalad*. Error,
  impalad*.info, and other logs in cluster_logs/data_loading. Couple of
  observations-
  > 1. Dependency on SSE3, error says exiting as hardware does not support
  SSE3.
  I think if you were able to compile this is ok. We have some inline
  assembly and SSE3 intrinsics, but you probably had to work around those
  already to build. You could fix this so that the check isn't done if
  running on PowerPC.
  > 2. One error says to increase num_of_threads_per_disk (something
  related to number of threads, not sure about exact variable name) while
  starting impalad
  Hmm, this is probably because its detection of local disks fails. This is
  expected to happen if running on a remote filesystem (e.g. a cloud
  filesystem like S3, or some specialised disk hardware like Isilon or
  DSSD). If it's happening with local disks, it's probably because it
  assumes it's running on linux with specific filesystem nodes for devices.
  > 3. A few log files say that bad_alloc
  I think this is the exception that gets thrown when malloc() fails (when
  called via the C++ new operator). I wonder if the system is low on
  memory? How much RAM do you have?

  I think how long it will take probably depends on what the end goal is:
  if it's just to get it running, probably a couple more weeks, maybe more
  if there are any particularly tricky bugs. If you want to do performance
  tuning, I feel like there's probably some more work there. I think we
  implicitly depend on certain properties on Intel hardware, e.g. recent
  Intel processors have reasonably fast unaligned loads and stores and in
  some cases wetake advantage of that, but I'm not sure if that's also true
  of the processors you're targeting.

  On Tue, Mar 8, 2016 at 8:26 AM, Nishidha Panpaliya <[email protected]>
  wrote:
   Hi Tim,

   As you suggested, I disabled codegen and also generated core dump.

   Core dump has pointed HashUtil::MurmurHash2_64 being problematic. Please
   see attached log file.
   (See attached file: hs_err_pid15697.log)

   I tested this function individually in a small test app and it worked.
   May be data given to it was simple enough for it to pass. But in case of
   Impala, there is some issue with data/arguments passed to this function
   in a particular case. Looks like this function is not called on machines
   where SSE is supported, so on x86, you might not see this crash. Do you
   suspect anything in this function or the functions calling this
   function? I'm still debugging more into this.
   If you have any clue, please point that to me so that I can try nail
   down the issue on that direction.

   Thanks,
   Nishidha

   Inactive hide details for nishidha randad ---03/07/2016 10:35:39
   PM---Thanks a lot Tim! I did check some of the impalad*. Errornishidha
   randad ---03/07/2016 10:35:39 PM---Thanks a lot Tim! I did check some of
   the impalad*. Error, impalad*.info, and other logs in cluster_

   From: nishidha randad <[email protected]>
   To: Tim Armstrong <[email protected]>
   Cc: Nishidha Panpaliya/Austin/Contr/IBM@IBMUS, Sudarshan
   Jagadale/Austin/Contr/IBM@IBMUS, [email protected]
   Date: 03/07/2016 10:35 PM



   Subject: Re: LeaseExpiredException while test data creation on ppc64le



   Thanks a lot Tim! I did check some of the impalad*. Error,
   impalad*.info, and other logs in cluster_logs/data_loading. Couple of
   observations-
   1. Dependency on SSE3, error says exiting as hardware does not support
   SSE3.
   2. One error says to increase num_of_threads_per_disk (something related
   to number of threads, not sure about exact variable name) while starting
   impalad
   3. A few log files say that bad_alloc


   I'm analysing all these errors. I'll dig more into this tomorrow and
   update you.
   One more help I wanted from you is in predicting the amount of work I
   may be left with and possible challenges ahead. It would be really great
   if you could point that to me from the logs I had posted.


   Also, about LLVM 3.7 fixes you did, I was wondering if you have
   completed upgradation, since you have also started encountering crashes.


   Thanks again!


   Nishidha


   On 7 Mar 2016 21:56, "Tim Armstrong" <[email protected]> wrote:


         Hi Nishidha,
           I started working on our next release cycle towards the end of
         last week, so I've been looking at LLVM 3.7 and have made a bit of
         progress getting it working on intel. We're trying to get it done
         working so we have plenty of chance to test it.

         RE the TTransportException error, that is often because of a
         crash. Usually to debug I would first look at
         the /tmp/impalad.ERROR and /tmp/impalad.INFO logs for the cause of
         the crash. The embedded JVM also generates hs_err_pid*.log files
         with a crash report that can sometimes be useful. If that doesn't
         reveal the cause, then I'd look to see if there is a core dump in
         the Impala directory (I normally run with "ulimit -c unlimited"
         set so that a crash will generate a core file).

         I already fixed a couple of problems with codegen in LLVM 3.7,
         including one crash that was an assertion about struct sizes. I'll
         be posting the patch soon once I've done a bit more testing.

         It might help to make progress is you disable LLVM codegen by
         default during data loading by setting the following environment
         variable:

         export
         
START_CLUSTER_ARGS='--impalad_args=-default_query_options="disable_codegen=1"'


         You can also start the test cluster with the same arguments or
         just set it in the set with "set disable_codegen=1).

         ./bin/start-impala-cluster.py
         --impalad_args=-default_query_options="disable_codegen=1"

         On Mon, Mar 7, 2016 at 5:13 AM, Nishidha Panpaliya <
         [email protected]> wrote:
         Hi Tim,

         Yes, I could fix this snappyError by building snappy-java for
         Power and adding the native library for power into existing
         snappy-java-1.0.4.1.jar used by hbase, hive, sentry and hadoop.
         The test data loading has been proceeded further and gave a new
         exception which I'm looking into and as below.

         Data Loading from Impala failed with error:
         ImpalaBeeswaxException:
         INNER EXCEPTION: <class
         'thrift.transport.TTransport.TTransportException'>
         MESSAGE: None

         Also, I've been able to start impala and try just one following
         query as given in
         https://github.com/cloudera/Impala/wiki/How-to-build-Impala-
         impala-shell.sh -q"SELECT version()"

         And regarding patch of my work, I'm sorry for the delay. Although
         it does not need any CLA to be signed, but it is under discussion
         with our IBM legal team, just to ensure we are compliant with the
         policies. Hoping to update you on this soon. Could you tell me
         when are you going to start with this new release cycle?

         Thanks,
         Nishidha

         Inactive hide details for Tim Armstrong ---03/05/2016 03:14:29
         AM---It also looks like it got far enough that you should have a
         Tim Armstrong ---03/05/2016 03:14:29 AM---It also looks like it
         got far enough that you should have a bit of data loaded - have
         you been able

         From: Tim Armstrong <[email protected]>
         To: nishidha panpaliya <[email protected]>
         Cc: Impala Dev <[email protected]>, Nishidha
         Panpaliya/Austin/Contr/IBM@IBMUS, Sudarshan
         Jagadale/Austin/Contr/IBM@IBMUS
         Date: 03/05/2016 03:14 AM
         Subject: Re: LeaseExpiredException while test data creation on
         ppc64le




         It also looks like it got far enough that you should have a bit of
         data loaded - have you been able to start impala and run queries
         on some of those tables?

         We're starting a new release cycle so I'm actually about to focus
         on upgrading our version of LLVM to 3.7 and getting the Intel
         support working. I think we're going to be putting a bit of effort
         into reducing LLVM code generation time: it seems like LLVM 3.7 is
         slightly slower in some cases.

         We should stay in sync, it would be good to make sure that any
         changes I make will work for your PowerPC work too. If you want to
         share any patches (even if you're not formally contributing them)
         it would be helpful for me to understand what you have already
         done on this path.

         Cheers,
         Tim

         On Fri, Mar 4, 2016 at 1:40 PM, Tim Armstrong <
         [email protected]> wrote:

                     Hi Nishidha,
                       It looks like Hive is maybe missing the native
                     snappy library: I see this in the logs:

                     java.lang.Exception: org.xerial.snappy.SnappyError:
                     [FAILED_TO_LOAD_NATIVE_LIBRARY] null
                         at org.apache.hadoop.mapred.LocalJobRunner
                     $Job.runTasks(LocalJobRunner.java:462)
                         at org.apache.hadoop.mapred.LocalJobRunner$Job.run
                     (LocalJobRunner.java:522)
                     Caused by: org.xerial.snappy.SnappyError:
                     [FAILED_TO_LOAD_NATIVE_LIBRARY] null
                         at org.xerial.snappy.SnappyLoader.load
                     (SnappyLoader.java:229)
                         at org.xerial.snappy.Snappy.<clinit>
                     (Snappy.java:44)
                         at org.apache.avro.file.SnappyCodec.compress
                     (SnappyCodec.java:43)
                         at org.apache.avro.file.DataFileStream
                     $DataBlock.compressUsing(DataFileStream.java:361)
                         at org.apache.avro.file.DataFileWriter.writeBlock
                     (DataFileWriter.java:394)
                         at org.apache.avro.file.DataFileWriter.sync
                     (DataFileWriter.java:413)



                     If you want to try making progress without Hive snappy
                     support, I think you coudl disable some of the files
                     formats by editing testdata/workloads/*/*.csv and
                     removing some of the "snap" file formats. The impala
                     test suite generates data in many different file
                     formats with different compression settings.


                     On Wed, Mar 2, 2016 at 7:08 AM, nishidha panpaliya <
                     [email protected]> wrote:
                     Hello,

                     After building Impala on ppc64le, I'm trying to run
                     all the tests of Impala. In the process, I'm getting
                     an error while test data creation.
                     Command ran -
                                                         $
                                                         
{IMPALA_HOME}/buildall.sh
 -testdata -format
                     Output - Attached log (output.txt)

                     Also attached logs named
                     
cluster_logs/data_loading/data-load-functional-exhaustive.log.
 And hive.log.

                     I tried setting below parameters in hive-site.xml but
                     of no use.
                                                         
hive.exec.max.dynamic.partitions=100000; 

                                                         
hive.exec.max.dynamic.partitions.pernode=100000;

                                                         
hive.exec.parallel=false

                     I'll be really thankful if you could provide me some
                     help here.

                     Thanks in advance,
                     Nishidha


                     --
                     You received this message because you are subscribed
                     to the Google Groups "Impala Dev" group.
                     To unsubscribe from this group and stop receiving
                     emails from it, send an email to impala-dev
                     [email protected].

Re: LeaseExpiredException while test data creation on ppc64le

Reply via email to