Congrats on the recognition of you work, Abhishek!
Todd
On Wed, Feb 22, 2023, 6:15 PM 邓科 wrote:
> Congrats Abhishek!!!
>
> Yingchun Lai 于2023年2月23日周四 08:05写道:
>
> > Congrats!
> >
> > Mahesh Reddy 于2023年2月23日 周四03:07写道:
> >
> > > Congrats Abhishek!!! Great work and well deserved!
> > >
> > > On
Hi Mauricio,
Sorry for the late reply on this one. Hope "better late than never" is the
case here :)
As you implied in your email, the main issue with increasing queue length
to deal with queue overflows is that it only helps with momentary spikes.
According to queueing theory (and intuition) if
r 29, 2020 at 11:44 AM pino patera
> wrote:
> >>
> >> Hi
> >> anyone integrated Kudu into Dremio (/www.dremio.com) data lake?
> >> Any alternative suggestion (i.e. Presto?) ?
> >>
> >> Thanks
>
--
Todd Lipcon
Software Engineer, Cloudera
n? Am asking
> because i see some correlation between update schema operation(deleting
> range partition) time and number transactions in-light on Kudu tablet
> servers.
>
> Regards Dmitry
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
to dockerize
> kudu.
> The problem I concern about dockerizing kudu is storage performance loss. I
> found out this Docker storage driver benchmarks (last updated October 2017)
> https://github.com/chriskuehl/docker-storage-benchmark
>
> Best regards,
> Kyle Zhike Chen
>
--
Todd Lipcon
Software Engineer, Cloudera
they can do real-time but I
> just watched a demo and looks like it is classical batch/incremental
> process.
> https://community.incorta.com/t/18d8x2/data-hubmaterialized-view-question
>
>
> https://community.incorta.com/t/18jndy/what-are-the-types-of-data-load-that-incorta-supports
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
Would be useful to capture top -H during the workload as well to see if any
particular threads are at 100%. Could be the reactor thread acting as a
bottleneck
On Fri, Jul 12, 2019, 10:54 AM Adar Lieber-Dembo wrote:
> Thanks for the detailed summary and analysis. I want to make sure I
> understan
34 kudu::Thread::SuperviseThread()
>
> @ 0x7f049719cdd5 start_thread
>
> @ 0x7f0495473ead __clone
>
This is just a warning about a potential latency blip, and likely
completely unrelated to the problem you're reporting.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
696..1176034391 2 (3920336..3923031) 2696
> 0
> 13: [36592..39191]: 1176037440..1176040039 2 (3926080..3928679) 2600
> 0
> 14: [39192..41839]: 1176072008..1176074655 2 (3960648..3963295) 2648
> 0
> 15: [41840..44423]: 1176097752..1176100335 2 (
ly, I
> was unable find any information, a few JIRA tasks only, but that didn't
> helped.
>
> https://issues.apache.org/jira/browse/KUDU-2853
> https://issues.apache.org/jira/browse/KUDU-1644
>
> Best regards, Sergey.
>
--
Todd Lipcon
Software Engineer, Cloudera
y 8 blocks and all other blocks are the
> hole.
>
>
> So looks like I can use formulas with confidence.
> Normal case: 8 MB/segment * 80 max segments * 2000 tablets = 1,280,000 MB
> = ~1.3 TB (+ some minor index overhead)
> Worse case: 8 MB/segment * 1 segment * 2000 tablets = 1,280,
on?
>
> 3. Not a question. Please, consider adding documentation about the
> estimation of WAL storage. Also, I can't found any mentions about index
> files, except here
> https://kudu.apache.org/docs/scaling_guide.html#file_descriptors.
>
> Thanks!
>
> --
> with best regards, Pavel Martynov
>
--
Todd Lipcon
Software Engineer, Cloudera
> As you can see kudu tool encodes zeros as \u, but don't encode some
> other non-text bytes.
>
> What do you think about it?
>
> --
> with best regards, Pavel Martynov
>
--
Todd Lipcon
Software Engineer, Cloudera
Hi Kudu community,
I'm happy to announce that the Kudu PMC has voted to add Yingchun Lai as a
new committer and PMC member.
Yingchun has been contributing to Kudu for the last 6-7 months and
contributed a number of bug fixes, improvements, and features, including:
- new CLI tools (eg 'kudu table
(code THRIFTTRANSPORT):
> TTransportException('TSocket read 0 bytes',).
> Could you pls tell me how to deal with this problem? By the way, the kudu
> is installed by rpm, the relatived url:
> https://github.com/MartinWeindel/kudu-rpm.
>
>
> Best wishes.
> yours truly,
> Jack Lin
>
--
Todd Lipcon
Software Engineer, Cloudera
used with every single query and to make things worse joined
> more than once in the same query.
>
> Is there a way to replicate this table on every node to improve
> performance and avoid broadcasting this table every time?
>
> On Mon, Jul 23, 2018 at 10:52 AM Todd Lipcon wrote
>> Mike
>>>>
>>>> Sent from my iPhone
>>>>
>>>> > On Jan 16, 2019, at 12:27 PM, Boris Tyukin
>>>> wrote:
>>>> >
>>>> > Hi guys,
>>>> >
>>>> > is there a setting on Kudu se
\
> libsasl2-dev \
> libsasl2-modules \
> libsasl2-modules-gssapi-mit \
> libssl-dev \
> libtool \
> lsb-release \
> make \
> ntp \
> net-tools \
> openjdk-8-jdk \
> openssl \
> patch \
> python-dev \
> python-pip \
> python3-dev \
> python3 \
> python3-pip \
> pkg-config \
> python \
> rsync \
> unzip \
> vim-common \
> wget
>
> #Install Kudu
> #RUN git clone https://github.com/apache/kudu \
> user@kudu.apache.orgWORKDIR /
> RUN wget
> https://www-us.apache.org/dist/kudu/1.8.0/apache-kudu-1.8.0.tar.gz
> RUN mkdir -p /kudu && tar -xzf apache-kudu-1.8.0.tar.gz -C /kudu
> --strip-components=1
> RUN ls /
>
> RUN cd /kudu \
> && thirdparty/build-if-necessary.sh
> RUN cd /kudu && mkdir -p build/release \
> && cd /kudu/build/release \
> && ../../thirdparty/installed/common/bin/cmake -DCMAKE_BUILD_TYPE=release
> -DCMAKE_INSTALL_PREFIX:PATH=/usr ../.. \
> && make -j4
>
> RUN cd /kudu/build/release \
> && make install
>
>
>
>
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
Kind regards,
>
> Alexey
>
> On Fri, Nov 16, 2018 at 7:24 PM Boris Tyukin
> wrote:
>
>> Hi Todd,
>>
>> We are on Kudu 1.5 still and I used Kudu client 1.7
>>
>> Thanks,
>> Boris
>>
>> On Fri, Nov 16, 2018, 17:07 Todd Lipcon >
>&g
NT32 key=9, STRING value=value 9, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.603000Z
> INT32 key=3, STRING value=value 3, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.595000Z
> INT32 key=10, STRING value=NULL, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.603000Z
> INT32 key=5, STRING value=value 5, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.597000Z
> INT32 key=7, STRING value=value 7, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.598000Z
>
>
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
>
>
> (env) root@boot2docker:~/kudu# python
> Python 2.7.12 (default, Dec 4 2017, 14:50:18)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import kudu
> >>>
ity,
>
> Does any body know what is the maximum distinct values of a String column
> that Kudu considers in order to set its encoding to Dictionary? Many thanks
> :)
>
> br,
>
>
--
Todd Lipcon
Software Engineer, Cloudera
to share with other systems.
One recommendation, though is to consider using a dedicated disk for the
Kudu WAL and metadata, which can help performance, since the WAL can be
sensitive to other heavy workloads monopolizing bandwidth on the same
spindle.
-Todd
>
> At 2018-08-03 02:26:37, "Tod
with 15 * 4TB spinning disk drives and 256GB
> RAM, 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is
> recommended to leave for other systems? We prefer to make the machine
> dedicated to Kudu. Can tablet server leverage the whole space efficiently?
> >
> > Thanks,
> > Quanlong
>
--
Todd Lipcon
Software Engineer, Cloudera
defaults or giving some more prescriptive
advice?
I'm a little nervous that saying "here are all the internals, and here are
100 config flags to study" will scare users more than help them :)
-Todd
>
> At 2018-08-02 01:06:40,"Todd Lipcon" wrote:
>
> On Wed, A
make this trade-off.
-Todd
> At 2018-06-15 23:41:17, "Todd Lipcon" wrote:
>
> Also, keep in mind that when the MRS flushes, it flushes into a bunch of
> separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by
> default). This is set by --budge
hing like an extremely high replication count.*
>
> *I could see bumping the replication count to 5 for these tables since the
> extra storage cost is low and it will ensure higher availability of the
> important central tables, but I'd be surprised if there is any measurable
> p
Impala 2.12. The external RPC protocol is still Thrift.
Todd
On Mon, Jul 23, 2018, 7:02 AM Clifford Resnick
wrote:
> Is this impala 3.0? I’m concerned about breaking changes and our RPC to
> Impala is thrift-based.
>
> From: Todd Lipcon
> Reply-To: "user@kudu.apache.org&qu
thread's context
>>>>> perhaps partition kudu table, even if small, into multiple tablets), it
>>>>> was
>>>>> to speed up joins/exchanges, not to parallelize the scan.
>>>>>
>>>>> For example recently we ran into
s or avoid it .
>
> Thanks !
>
> Best regards .
>
>
> --
>
> wang jiaxi
>
--
Todd Lipcon
Software Engineer, Cloudera
e should happen automatically so long as the filter predicate has
been pushed down. Using 'explain()' and showing us the results, along with
the code you used to create your table, will help understand what might be
the problem with performance.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
te should be executed:
>
> *UPDATE hive_meta_store_database.TABLE_PARAMS*
>
> *SET PARAM_VALUE = 'master-1,master-2,master-3'*
>
> *WHERE PARAM_KEY = 'kudu.master_addresses' AND PARAM_VALUE =
> 'master-1,master-2,master-3,master-4';*
>
> After upgrades, the master-4 node to be removed by running
> steps 1-5.
>
>
>
> Thanks!
>
>
>
> Best regards,
>
> *Sergejs Andrejevs*
>
> Information about how we process personal data
> <http://www.intrum.com/privacy>
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
ssion contains Insert, Update, Delete operations, if
> the database does not exist in the data there will be
> some new data loss, how to avoid such problems.
>
--
Todd Lipcon
Software Engineer, Cloudera
_time_us":1288971,"lbm_reads_1-10_ms
>> <https://maps.google.com/?q=1-10_ms+:+32&entry=gmail&source=g>":32,"
>> lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_
>> time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":
>> 25,"spinlock_wait_cycles":155264,"tcmalloc_contention_
>> cycles":768,"thread_start_us":677,"threads_started":14,"wal-
>> append.queue_time_us":300}
>>
>> The flush_threshold_mb is set in the default value (1024). Wouldn't the
>> flushed file size be ~1GB?
>>
>> I think increasing the initial RowSet size can reduce compactions and
>> then reduce the impact of other ongoing operations. It may also improve the
>> flush performance. Is that right? If so, how can I increase the RowSet size?
>>
>> I'd be grateful if someone can make me clear about these!
>>
>> Thanks,
>> Quanlong
>>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
panies registered or incorporated in the European Union. This e-mail may
> contain confidential and/or privileged information. If you are not the
> intended recipient (or have received this e-mail in error) please notify
> the sender immediately and delete this e-mail. Any unauthorized copying,
> disclosure or distribution of the material in this e-mail is strictly
> forbidden.
>
>
--
Todd Lipcon
Software Engineer, Cloudera
On Mon, May 21, 2018 at 4:37 PM, Quanlong Huang
wrote:
> Hi friends,
>
> We're trying to benchmark Impala+kudu to compare with other lambda
> architectures like Druid. So we hope we can install the latest release
> version of Impala (2.12.0) and kudu (1.7.0). However, when following the
> instal
ct even though these data were firstly loaded.
> i do not know compaction mechanism of kudu, will it lead to many
> compaction, thus lead to bad scan performance.
>
> Best regards.
>
--
Todd Lipcon
Software Engineer, Cloudera
quot;78" <= VALUES < "785000",
> PARTITION "785000" <= VALUES < "79",
> PARTITION "79" <= VALUES < "795000",
> PARTITION "795000" <= VALUES < "80",
> PARTITION "80" <= VALUES < "805000",
> PARTITION "805000" <= VALUES < "81",
> PARTITION "81" <= VALUES < "815000",
> PARTITION "815000" <= VALUES < "82",
> PARTITION "82" <= VALUES < "825000",
> PARTITION "825000" <= VALUES < "83",
> PARTITION "83" <= VALUES < "835000",
> PARTITION "835000" <= VALUES < "84",
> PARTITION "84" <= VALUES < "845000",
> PARTITION "845000" <= VALUES < "85",
> PARTITION "85" <= VALUES < "855000",
> PARTITION "855000" <= VALUES < "86",
> PARTITION "86" <= VALUES < "865000",
> PARTITION "865000" <= VALUES < "87",
> PARTITION "87" <= VALUES < "875000",
> PARTITION "875000" <= VALUES < "88",
> PARTITION "88" <= VALUES < "885000",
> PARTITION "885000" <= VALUES < "89",
> PARTITION "89" <= VALUES < "895000",
> PARTITION "895000" <= VALUES < "90",
> PARTITION "90" <= VALUES < "905000",
> PARTITION "905000" <= VALUES < "91",
> PARTITION "91" <= VALUES < "915000",
> PARTITION "915000" <= VALUES < "92",
> PARTITION "92" <= VALUES < "925000",
> PARTITION "925000" <= VALUES < "93",
> PARTITION "93" <= VALUES < "935000",
> PARTITION "935000" <= VALUES < "94",
> PARTITION "94" <= VALUES < "945000",
> PARTITION "945000" <= VALUES < "95",
> PARTITION "95" <= VALUES < "955000",
> PARTITION "955000" <= VALUES < "96",
> PARTITION "96" <= VALUES < "965000",
> PARTITION "965000" <= VALUES < "97",
> PARTITION "97" <= VALUES < "975000",
> PARTITION "975000" <= VALUES < "98",
> PARTITION "98" <= VALUES < "985000",
> PARTITION "985000" <= VALUES < "99",
> PARTITION "99" <= VALUES < "995000",
> PARTITION VALUES >= "995000"
> )
>
>
>
So it looks like you have a numeric value being stored here in the string
column. Are you sure that you are properly zero-padding when creating your
key? For example if you accidentally scan from "50_..." to "80_..." you
will end up scanning a huge portion of your table.
> i did not delete rows in this table ever.
>
> my scanner code is below:
> buildKey method will build the lower bound and the upper bound, the unique
> id is same, the startRow offset(third part) is 0, and the endRow offset is
> , startRow and endRow only differs from time.
> though the max offset is big(999), generally it is less than 100.
>
> private KuduScanner buildScanner(Metric startRow, Metric endRow,
> List dimensionIds, List dimensionFilterList) {
> KuduTable kuduTable =
> kuduService.getKuduTable(BizConfig.parseFrom(startRow.getBizId()));
>
> PartialRow lower = kuduTable.getSchema().newPartialRow();
> lower.addString("key", buildKey(startRow));
> PartialRow upper = kuduTable.getSchema().newPartialRow();
> upper.addString("key", buildKey(endRow));
>
> LOG.info("build scanner. lower = {}, upper = {}", buildKey(startRow),
> buildKey(endRow));
>
> KuduScanner.KuduScannerBuilder builder =
> kuduService.getKuduClient().newScannerBuilder(kuduTable);
> builder.setProjectedColumnNames(COLUMNS);
> builder.lowerBound(lower);
> builder.exclusiveUpperBound(upper);
> builder.prefetching(true);
> builder.batchSizeBytes(MAX_BATCH_SIZE);
>
> if (CollectionUtils.isNotEmpty(dimensionFilterList)) {
> for (int i = 0; i < dimensionIds.size() && i < MAX_DIMENSION_NUM;
> i++) {
> for (DimensionFilter dimensionFilter : dimensionFilterList) {
> if (!Objects.equals(dimensionFilter.getDimensionId(),
> dimensionIds.get(i))) {
> continue;
> }
> ColumnSchema columnSchema =
> kuduTable.getSchema().getColumn(String.format("dimension_%02d", i));
> KuduPredicate predicate = buildKuduPredicate(columnSchema,
> dimensionFilter);
> if (predicate != null) {
> builder.addPredicate(predicate);
> LOG.info("add predicate. predicate = {}",
> predicate.toString());
> }
> }
> }
> }
> return builder.build();
> }
>
>
What client version are you using? 1.7.0?
> i checked the metrics, only get content below, it seems no relationship
> with my table.
>
Looks like you got the metrics from the kudu master, not a tablet server.
You need to figure out which tablet server you are scanning and grab the
metrics from that one.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
{remote=136.243.74.42:7050
>> (slave5), user_credentials={real_user=root}} blocked reactor thread for
>> 35859.8us
>>
>> I0507 09:38:15.942150 29882 outbound_call.cc:288] RPC callback for RPC
>> call kudu.tserver.TabletServerService.Write -> {remote=136.243.74.42:7050
>> (slave5), user_credentials={real_user=root}} blocked reactor thread for
>> 40664.9us
>>
>> I0507 09:38:17.495046 29882 outbound_call.cc:288] RPC callback for RPC
>> call kudu.tserver.TabletServerService.Write -> {remote=136.243.74.42:7050
>> (slave5), user_credentials={real_user=root}} blocked reactor thread for
>> 49514.6us
>>
>> I0507 09:46:12.664149 4507 coordinator.cc:783] Release admission control
>> resources for query_id=3e4a4c646800e1d9:c859bb7f
>>
>> F0507 09:46:12.673912 29258 error-util.cc:148] Check failed:
>> log_entry.count > 0 (-1831809966 vs. 0)
>>
>> Wrote minidump to /tmp/minidumps/impalad/a9113d9
>> b-bc3d-488a-1feebf9b-47b42022.dmp
>>
>>
>>
>> *Note*:
>>
>> We are executing the queries on 8 node cluster with the following
>> configuration
>>
>> Cluster : 8 Node Cluster (48 GB RAM , 8 CPU Core and 2 TB hard-disk each,
>> Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
>>
>>
>>
>>
>>
>> --
>>
>> Regards,
>>
>> Geetika Gupta
>>
>
>
>
> --
> Regards,
> Geetika Gupta
>
--
Todd Lipcon
Software Engineer, Cloudera
a between the bound is about 8000, so i should call hundreds
> times nextRows() to fetch all data, and it finally cost several minutes.
>
> i don't know why this happened and how to resolve itmaybe the final
> solution is that i should giving up kudu, using hbase instead...
>
--
Todd Lipcon
Software Engineer, Cloudera
if you had for example:
pre-chunk in-list: 1,2,3,4,5,6
chunk 1: col2 IN (1,6)
chunk 2: col2 IN (2,5)
chunk 3: col2 IN (3,4)
then you will actually scan over the middle portion of that table 3 times.
If you sort the in-list before chunking you'll avoid the multiple-scan
effect here.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
ot the entire PK, it will only be
used on the read path when that actual column is selected, and it has the
same performance impact (positive or negative) as any other column in the
row.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
subjects that users can use in
> the future. Thanks.
>
> Regards,
>
>
--
Todd Lipcon
Software Engineer, Cloudera
n on the table like
> insert in to the table, It throws exception like :Couldnot find any valid
> location. Unkown host exception. Thanks in advance for your valuable time.
--
Todd Lipcon
Software Engineer, Cloudera
min_ratio (default 0.1). Raising
this would decrease the frequency of major delta compaction, but I
think there is likely something else going on here.
-Todd
> >
> > Can you give me some suggestions to optimize this performance problem?
Usually the best way to improve performance is by thinking carefully
about schema design, partitioning, and workload, rather than tuning
configuration. Maybe you can share more about your workload, schema,
and partitioning.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
; limitation "Maximum number of tablets per table for each tablet server is
> 60, post-replication"? Is it possible that this restriction will be removed?
>
See above.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
f replicas to 3 in order to have fault tolerance.
>
> XiaoNing: So if we want to have fault tolerance, we should at least set
> the replica number to be 3, right?
>
That's right.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> Thanks
>
> Rainerdun
>
>
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
t many nodes.
>
> Wouldn't it be useful here for Cliff's small dims to be partitioned into a
> couple tablets to similarly improve parallelism?
>
> -m
>
> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon wrote:
>
>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick
re basic reason.
>
Impala could definitely be smarter, just a matter of programming
Kudu-specific join strategies into the optimizer. Today, the optimizer
isn't aware of the unique properties of Kudu scans vs other storage
mechanisms.
-Todd
>
> -Cliff
>
> On Fri, Ma
perhaps to sum thing up, if nearly 100% of my metadata scan are single
> Primary Key lookups followed by a tiny broadcast then am I really just
> splitting hairs performance-wise between Kudu and HDFS-cached parquet?
>
> From: Todd Lipcon
> Reply-To: "user@kudu.apache.org"
du.
>>>> One Redshift feature that we will miss is its ALL Distribution, where a
>>>> copy of a table is maintained on each server. We define a number of
>>>> metadata tables this way since they are used in nearly every query. We are
>>>> considering using parquet in HDFS cache for these, and Kudu would be a much
>>>> better fit for the update semantics but we are worried about the additional
>>>> contention. I'm wondering if having a Broadcast, or ALL, tablet
>>>> replication might be an easy feature to add to Kudu?
>>>>
>>>> -Cliff
>>>>
>>>
>>>
>>
>
--
Todd Lipcon
Software Engineer, Cloudera
define a number of
>>> metadata tables this way since they are used in nearly every query. We are
>>> considering using parquet in HDFS cache for these, and Kudu would be a much
>>> better fit for the update semantics but we are worried about the additional
>>> contention. I'm wondering if having a Broadcast, or ALL, tablet
>>> replication might be an easy feature to add to Kudu?
>>>
>>> -Cliff
>>>
>>
>>
>
--
Todd Lipcon
Software Engineer, Cloudera
hroughput to drop.
This is tracked by KUDU-1693. I believe there was another JIRA somewhere
related as well, but can't seem to find it. Unfortunately fixing it is not
straightforward, though would have good impact for these cases where a
single writer is fanning out to tens or hundreds of tablets.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
What client are you using to benchmark? You might also be bound by the
client performance.
On Mar 11, 2018 2:04 PM, "Brock Noland" wrote:
> Hi,
>
> I'd verify that the new nodes are assigned tablets? Along with
> considering an increase the number of partitions on the table being tested.
>
> On
and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
ables and as you can see from my example query,
> it is a really straight select from a table - no joins, no predicates and
> no complex calculations.
>
> Thanks again,
> Boris
>
> On Thu, Feb 22, 2018 at 2:44 PM, Todd Lipcon wrote:
>
>> In addition to what Hao suggests,
not seem to find a good strategy. The only thing came
>> to my mind is to drop the production table and rename a staging table to
>> production table as the last step of the job, but in this case we are going
>> to lose statistics and security permissions.
>>
>> Any other ideas?
>>
>> Thanks!
>> Boris
>>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
_id,
>>
>> CAST(nomen_string_flag as STRING) nomen_string_flag,
>>
>> src_event_id,
>>
>> CAST(last_utc_ts as BIGINT) last_utc_ts,
>>
>> device_free_txt,
>>
>> CAST(trait_bit_map as STRING) trait_bit_map,
>>
>> CAST(clu_subkey1_flag as STRING) clu_subkey1_flag,
>>
>> CAST(clinsig_updt_dt_tm as BIGINT) clinsig_updt_dt_tm,
>>
>> CAST(event_end_dt_tm as BIGINT) event_end_dt_tm,
>>
>> CAST(event_start_dt_tm as BIGINT) event_start_dt_tm,
>>
>> CAST(expiration_dt_tm as BIGINT) expiration_dt_tm,
>>
>> CAST(verified_dt_tm as BIGINT) verified_dt_tm,
>>
>> CAST(src_clinsig_updt_dt_tm as BIGINT) src_clinsig_updt_dt_tm,
>>
>> CAST(updt_dt_tm as BIGINT) updt_dt_tm,
>>
>> CAST(valid_from_dt_tm as BIGINT) valid_from_dt_tm,
>>
>> CAST(valid_until_dt_tm as BIGINT) valid_until_dt_tm,
>>
>> CAST(performed_dt_tm as BIGINT) performed_dt_tm,
>>
>> txn_id_text,
>>
>> CAST(ingest_dt_tm as BIGINT) ingest_dt_tm
>>
>> FROM v500.clinical_event
>>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
ious to hear what approach you took.
-Todd
On Tue, Jan 30, 2018 at 11:08 PM, Pavel Martynov wrote:
> Ok, I found ticket https://issues.apache.org/jira/browse/KUDU-418, which
> fired at me.
>
--
Todd Lipcon
Software Engineer, Cloudera
moke
tests of Kudu on ~800 nodes before.
>
> Looking forward to your inputs on any organisation using Kudu where data
> volumes of more than 10 TB is ingested everyday.
>
Hope some other users can chime in.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
9, 2018 at 2:22 PM, Todd Lipcon wrote:
>
>> On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles
>> wrote:
>>
>>> Hi Boris.
>>>
>>> 1) I would like to bypass Impala as data for my bulk load coming from
>>>> sqoop and avro files are stored
table
>> SELECT * FROM some_csv_tabledoes the trick.
>>
>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS,
>> HBase, or any other data store that has an InputFormat.
>>
>> No tool is provided to load data directly into Kudu’s on-disk data
>> format. We have found that for many workloads, the insert performance of
>> Kudu is comparable to bulk load performance of other systems.
>>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
to new releases coming up!
>
> Boris
>
> On Fri, Jan 5, 2018 at 9:08 PM, Todd Lipcon wrote:
>
>> On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin
>> wrote:
>>
>>> Hi Todd,
>>>
>>> thanks for your feedback! sure will be happy to update my po
microsecond
precision, so that's what Kudu implemented internally. With 64 bits there
is still enough range to store dates for 584,554 years at microsecond
precision.
I think
https://impala.apache.org/docs/build/html/topics/impala_timestamp.html has
some info about Kudu compatibility and limitations.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
ation whereas a string representation
would be a little clearer.
Thanks
-Todd
>
> On Fri, Jan 5, 2018 at 11:13 AM, Todd Lipcon wrote:
>
>> Oh, one other piece of feedback: maybe worth editing the title to say "vs
>> Apache Parquet" instead of "vs Apache
Oh, one other piece of feedback: maybe worth editing the title to say "vs
Apache Parquet" instead of "vs Apache Impala" since in all cases you are
using Impala as the query engine?
-Todd
On Fri, Jan 5, 2018 at 11:06 AM, Todd Lipcon wrote:
> Hey Boris,
>
> Thank
pers for such an amazing and much-needed product.
>
> Boris
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
uster. See https://kudu.apache.org/docs/command_line_tools_referenc
>>>>>> e.html#cluster-ksck for more details. For restarting a cluster, I
>>>>>> would recommend taking down all tablet servers at once, otherwise
>>>>>> tablet
>>>>>> replicas may try to replicate data from the server that was taken
>>>>>> down.
>>>>>>
>>>>>> Hope this helped,
>>>>>> Andrew
>>>>>>
>>>>>> On Tue, Dec 5, 2017 at 10:42 AM, Petter von Dolwitz (Hem) <
>>>>>> petter.von.dolw...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Kudu users,
>>>>>>>
>>>>>>> We just started to use Kudu (1.4.0+cdh5.12.1). To make a baseline for
>>>>>>> evaluation we ingested 3 month worth of data. During ingestion we
>>>>>>> were
>>>>>>> facing messages from the maintenance threads that a soft memory
>>>>>>> limit were
>>>>>>> reached. It seems like the background maintenance threads stopped
>>>>>>> performing their tasks at this point in time. It also so seems like
>>>>>>> the
>>>>>>> memory was never recovered even after stopping ingestion so I guess
>>>>>>> there
>>>>>>> was a large backlog being built up. I guess the root cause here is
>>>>>>> that we
>>>>>>> were a bit too conservative when giving Kudu memory. After a
>>>>>>> reststart a
>>>>>>> lot of maintenance tasks were started (i.e. compaction).
>>>>>>>
>>>>>>> When we verified that all data was inserted we found that some data
>>>>>>> was missing. We added this missing data and on some chunks we got the
>>>>>>> information that all rows were already present, i.e impala says
>>>>>>> something
>>>>>>> like Modified: 0 rows, nnn errors. Doing the verification again
>>>>>>> now
>>>>>>> shows that the Kudu table is complete. So, even though we did not
>>>>>>> insert
>>>>>>> any data on some chunks, a count(*) operation over these chunks now
>>>>>>> returns
>>>>>>> a different value.
>>>>>>>
>>>>>>> Now to my question. Will data be inconsistent if we recycle Kudu
>>>>>>> after
>>>>>>> seeing soft memory limit warnings?
>>>>>>>
>>>>>>> Is there a way to tell when it is safe to restart Kudu to avoid these
>>>>>>> issues? Should we use any special procedure when restarting (e.g.
>>>>>>> only
>>>>>>> restart the tablet servers, only restart one tablet server at a time
>>>>>>> or
>>>>>>> something like that)?
>>>>>>>
>>>>>>> The table design uses 50 tablets per day (times 90 days). It is 8 TB
>>>>>>> of data after 3xreplication over 5 tablet servers.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Petter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Andrew Wong
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Andrew Wong
>>>>>
>>>>>
>>>>
>>>>
>>>
>
> --
> David Alves
>
--
Todd Lipcon
Software Engineer, Cloudera
Hi Kudu community,
I'm pleased to announce that the Kudu PMC has voted to add Andrew Wong,
Grant Henke, and Hao Hao as Kudu committers and PMC members. This
announcement is a bit delayed, but I figured it's better late than never!
Andrew has contributed to Kudu in a bunch of areas. Most notably,
ht choose INT128 if available even if they have
no need for anywhere near that range.
-Todd
>
> On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert
> wrote:
>
> > Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
> >
> > - Dan
d master lists are different: 10.15.213.10:7051 10.15.213.11:7051
> 10.15.213.12:7051 :0`
>
>
>
> It was the same on the other 2 master machine.
>
>
>
> I have no idea what’s going on. Am I misunderstanding this configure
> option?
>
>
>
>
>
>
>
> Best wishes.
>
>
>
> Liou Fongcyuan
>
--
Todd Lipcon
Software Engineer, Cloudera
ashes and other similar types
>> of data.
>>
>> Is there any interest or uses for a INT128 column type? Is anyone
>> currently using a STRING or BINARY column for 128 bit data?
>>
>> Thank you,
>> Grant
>> --
>> Grant Henke
>> Software Engineer | Cloudera
>> gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
es.apache.org/jira/browse/KUDU-1078, and the issues's status
> is reopen, I have upload log for analysis the issues, If you want to more
> detail, just tell me 😄。
> log files:
> https://drive.google.com/open?id=1_1l2xpT3-NmumgI_sIdxch-6BocXqTCt
> https://drive.google.com/open?i
One thing you might try is to update the consensus rpc timeout to 30
seconds instead of 1. We changed the default in later versions.
I'd also recommend updating up 1.4 or 1.5 for other related fixes to
consensus stability. I think I recall you were on 1.3 still?
Todd
On Nov 3, 2017 7:47 PM, "Le
d the max.
> error.
>
> Franco
>
>
> --
> *From: *"Todd Lipcon"
> *To: *user@kudu.apache.org
> *Sent: *Wednesday, November 1, 2017 8:00:09 PM
> *Subject: *Re: Error message: 'Tried to update clock beyond the max.
> error.'
>
>
> What
and even time values up to 1000 seconds in the future (we read 1
billion nanoseconds as 1 billion microseconds (=1000 seconds)). I'll work
on reproducing this and a patch, to backport to previous versions.
-Todd
On Wed, Nov 1, 2017 at 5:00 PM, Todd Lipcon wrote:
> What's the full l
What's the full log line where you're seeing this crash? Is it coming from
tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
-Todd
2017-11-01 15:45 GMT-07:00 Franco Venturi :
> Our version is kudu 1.5.0-cdh5.13.0.
>
> Franco
>
>
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
's a torture-test cluster of
sorts that is always way out of balance, re-replicating stuff, etc)
-Todd
>
>
>
> On Wed, Nov 1, 2017 at 1:40 PM, Todd Lipcon wrote:
>
>> On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun wrote:
>>
>>> Thanks Todd! I improved my code
information about these background
> operations? I want to understand what happens in situations when some node
> is offline and then comes back up after a while. What is tablet
> initialization and bootstrapping, etc.
>
> --
> Br.
> Janne Keskitalo,
> Database Architect, PAF.COM
> For support: dbdsupp...@paf.com
>
>
--
Todd Lipcon
Software Engineer, Cloudera
kudu test loadgen" and may have fewer options
available.
-Todd
On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon wrote:
>
>> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon wrote:
>>
>>> Sounds good.
>>>
>>> BTW, you can try a quick load test using the
On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon wrote:
> Sounds good.
>
> BTW, you can try a quick load test using the 'kudu perf loadgen' tool.
> For example something like:
>
> kudu perf loadgen my-kudu-master.example.com --num-threads=8
> --num-rows-per-thread
threads=8
--num-rows-per-thread=100 --table-num-buckets=32
There are also a bunch of options to tune buffer sizes, flush options, etc.
But with the default settings above on an 8-node cluster I have, I was able
to insert 8M rows in 44 seconds (180k/sec).
Adding --buffer-size-bytes=1000 almos
ion on the UUID.
This should ensure that you get pretty good batching of the writes.
Todd
> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon wrote:
>
>> In addition to what Zhen suggests, I'm also curious how you are sizing
>> your batches in manual-flush mode? With 128 hash partiti
31 15:07 GMT+08:00 Chao Sun :
>
>> OK. Thanks! I changed to manual flush mode and it increased to ~15K /
>> sec. :)
>>
>> Is there any other tuning I can do to further improve this? and also, how
>> much would
>> SSD help in this case (only upsert)?
>>
>
e.newInsert();
> PartialRow row = insert.getRow();
> // fill the columns
> kuduSession.apply(insert)
> }
>
> I didn't specify the flushing mode, so it will pick up the AUTO_FLUSH_SYNC
> as default?
> should I use MANUAL_FLUSH?
>
> Thanks,
> Chao
>
> On
Hey Chao,
Nice to hear you are checking out Kudu.
What are you using to consume from Kafka and write to Kudu? Is it possible
that it is Java code and you are using the SYNC flush mode? That would
result in a separate round trip for each record and thus very low
throughput.
Todd
On Oct 30, 2017
t2, node3 t3 (t1 <
> t2 < t3) then reading client attached node1 can see record but other
> reading clients attached not node1(node2, node3) have possibilities missing
> record1.
> >
> > I think that does not happens in kudu, and i wonder how kudu synchronize
> real time data.
> >
> > Thanks!
> >
>
>
--
Todd Lipcon
Software Engineer, Cloudera
On Tue, Oct 24, 2017 at 12:41 PM, Todd Lipcon wrote:
> I've filed https://issues.apache.org/jira/browse/KUDU-2198 to provide a
> workaround for systems like this. I should have a patch up shortly since
> it's relatively simple.
>
>
... and here's the patch, if
Mon, Oct 16, 2017 at 2:29 PM, Matteo Durighetto <
> m.durighe...@miriade.it> wrote:
> > the "abcdefgh1234" it's an example of the the string created by the
> cloudera manager during the enable kerberos.
>
> ...
>
> On Mon, Oct 16, 2017 at 11:57 PM, Todd
kudu on the ASF slack in case we decide to go
> forward with this. If we don't decide to go forward with it, it's a good
> idea to hold onto the channel and pin a message in there about how to get
> to the "official" Kudu slack.
>
> On Mon, Oct 23, 2017 at 3:00 PM,
typically the best choice for a streaming ingest
or bulk load scenario since it aims to manage buffer sizes for you
automatically for best performance. We'll continue to invest on making
AUTO_FLUSH_BACKGROUND work as well as possible for these scenarios.
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
he official ASF slack (http://the-asf.slack.com/
> )
> and migrate our discussions there. What does everyone think?
>
--
Todd Lipcon
Software Engineer, Cloudera
Cannot find installed kudu client.
>>
>>
>> Command "python setup.py egg_info" failed with error code 1 in
>> c:\users\rani\app
>> data\local\temp\pip-build-7ildct\kudu-python
>>
>>
--
Todd Lipcon
Software Engineer, Cloudera
apping and instead just use the
"simple" mapping of using the short principal name? Generally we'd prefer
to have as simple a configuration as possible but if your configuration is
relatively commonplace it seems we might want an easier workaround than
duplicating krb5.conf.
-Todd
>
uot;
> c4ed5cb73f5644a8804d3abc976d02f8" member_type: VOTER last_known_addr {
> host: "cloud-ocean-kudu-02" port: 7050 } } peers { permanent_uuid: "
> 067e1e7245154f0fb2720dec6c77feec" member_type: VOTER last_known_addr {
> host: "cloud-ocean-kudu-04" port: 7050 } } } (1 of 76249 similar)
>
> 2017-09-06 14:04 GMT+08:00 Lee King :
>
>> We got an error about :Service unavailable: Transaction failed, tablet
>> 2758e5c68e974b92a3060db8575f3621 transaction memory consumption
>> (67031036) has exceeded its limit (67108864) or the limit of an ancestral
>> tracker.It looks like https://issues.apache.org/jira/browse/KUDU-1912.
>> and the bug will be fix at 1.5,but out version is 1.4,Is there any affect
>> for kudu stablity or data consistency?
>>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
the kerberos configuration, but in typical configurations it's
determined by the 'auth_to_local' configuration in your krb5.conf. See the
corresponding section in the docs here:
https://web.mit.edu/kerberos/krb5-1.12/doc/admin/conf_files/krb5_conf.html
My guess is that your host has been configured such that when the master
maps its own principal, it's getting a different result than when it maps
the principal being used by the tservers.
Hope that gets you on the right track.
Thanks
-Todd
--
Todd Lipcon
Software Engineer, Cloudera
for synchronous writes to Kudu), but we would like to have some pretty good
>>>> confidence that the secondary instance contains all the changes that the
>>>> primary has up to say an hour before (or something like that).
>>>>
>>>>
>>>> So far we considered a couple of options:
>>>> - refreshing the seconday instance with a full copy of the primary one
>>>> every so often, but that would mean having to transfer say 50TB of data
>>>> between the two locations every time, and our network bandwidth constraints
>>>> would prevent to do that even on a daily basis
>>>> - having a column that contains the most recent time a row was updated,
>>>> however this column couldn't be part of the primary key (because the
>>>> primary key in Kudu is immutable), and therefore finding which rows have
>>>> been changed every time would require a full scan of the table to be
>>>> sync'd. It would also rely on the "last update timestamp" column to be
>>>> always updated by the application (an assumption that we would like to
>>>> avoid), and would need some other process to take into accounts the rows
>>>> that are deleted.
>>>>
>>>>
>>>> Since many of today's RDBMS (Oracle, MySQL, etc) allow for some sort of
>>>> 'Change Data Capture' mechanism where only the 'deltas' are captured and
>>>> applied to the secondary instance, we were wondering if there's any way in
>>>> Kudu to achieve something like that (possibly mining the WALs, since my
>>>> understanding is that each change gets applied to the WALs first).
>>>>
>>>>
>>>> Thanks,
>>>> Franco Venturi
>>>>
>>>
>>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
ra,+CA+93101&entry=gmail&source=g>
>
> Overview <http://www.impactradius.com/?src=slsap> | Twitter
> <https://twitter.com/impactradius> | Facebook
> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
> <https://www.linkedin.com/company/impact-radius-inc->
>
--
Todd Lipcon
Software Engineer, Cloudera
Oops, adding the original poster in case he or she is not subscribed to the
list.
On Sep 19, 2017 10:46 PM, "Todd Lipcon" wrote:
> Hi Yuya,
>
> There should be no problem to use the Apache Kudu logo in your conference
> slides, assuming you are just using as intended to de
1 - 100 of 235 matches
Mail list logo