Coments inline...
Thanks,
Colton McInroy

 * Director of Security Engineering

        
Phone
(Toll Free)     
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 9/27/2013 5:02 AM, Aaron McCurry wrote:
I have commented inline below:


On Thu, Sep 26, 2013 at 11:00 AM, Colton McInroy <[email protected]>wrote:
     I do have a few question if you don't mind... I am still trying to
wrap my head around how this works. In my current implementation for a
logging system I create new indexes for each hour because I have a massive
amount of data coming in. I take in live log data from syslog and
parse/store it in hourly lucene indexes along with a facet index. I want to
turn this into a distributed redundant system and blur appears to be the
way to go. I tried elasticsearch but it is just too slow compared to my
current implementation. Given I take in gigs of raw log data an hour, I
need something that is robust and able to keep up with in flow of data.

Due to the current implementation of building up an index for an hour and
then making available.  I would use MapReduce for this:

http://incubator.apache.org/blur/docs/0.2.0/using-blur.html#map-reduce

That way all the shards in a table get a little more data each hour and
it's very low impact on the running cluster.
Not sure I understand this. I would like data to be accessible live as it comes in, not wait an hour before I can query against it. I am also not sure where map-reduce comes in here. I thought mapreduce is something that blur used internally.
     When taking in lots of data constantly, how is it recommended that it
be stored? I mentioned above that I create a new index for each hour to
keep data separated and quicker to search. If I want to look up a specific
time frame, I only have to load the directories timestamped with the hours
I want to look at. So instead of having to look at a huge index of like a
years worth of data, i'm looking at a much smaller data set which results
in faster query response times. Should a new table be created for each hour
of data? When I typed in the create command into the shell, it takes about
6 seconds to create a table. If I have to create a table for each
application each hour, this could create a lot of lag. Perhaps this is just
in my test environment though. Any thoughts on this? I also didn't see any
examples of how to create tables via code.

First off Blur is designed to store very large amounts of data.  And while
it can do NRT updates like Solr and ES it's main focus in on bulk ingestion
through MapReduce.  Given that, the real limiting factor is how much
hardware you have.  Let's play out a scenario.  If you are adding 10GB of
data an hour and I would think that a good rough ballpark guess is that you
will need 10-15% of inbound data size as memory to make the search perform
well.  However as the index sizes increase this % may decrease over time.
  Blur has an off-heap lru cache to make accessing hdfs faster, however if
you don't have enough memory the searches (and the cluster for that matter)
won't fail, they will simply become slower.

So it's really a question of how much hardware you have.  If you have
filling a table enough to where it does perform well given the cluster you
have.  You might have to break it into pieces.  But I think that hourly is
too small.  Daily, Weekly, Monthly, etc.
In my current system (which uses just lucene) I designed we take in mainly web logs and separate them into indexes. Each web server gets it's own index for each hour. Then when I need to query the data, I use a multi index reader to access the timeframe I need allowing me to keep the size of index down to roughly what I need to search. If data was stored over a month, and I want to query data that happened in just a single hour, or a few minutes, it makes sense to me to keep things optimized. Also, if I wanted to compare one web server to another, I would just use the multi index reader to load both indexes. This is all handled by a single server though, so it is limited by the hardware of the single server. If something fails, it's a big problem. When trying to query large data sets, it's again, only a single server, so it takes longer than I would like if the index it's reading is large. I am not entirely sure how to go about doing this in blur. I'm imagining that each "table" is an index. So I would have a table format like... YYYY_MM_DD_HH_IP. If I do this though, is there a way to query multiple tables... like a milti table reader or something? or am I limited to looking at a single table at a time? For some web servers that have little traffic, an hour of data may only have a few mb of data in it while other may have like a 5-10gb index. If I combined the index from a large site with the small sites, this should make everything slower for the queries against the small sites index correct? Or would it all be the same due to how blur separates indexes into shards? Would it perhaps be better to have an index for each web server, and configure small sites to have less shards while larger sites have more shards? We just got a new really large powerful server to be our log server, but as I realize that it's a single point of failure, I want to change our configuration to use a clustered/distributed configuration. So we would start with probably a minimal configuration, and start adding more shard servers when ever we can afford it or need it.
     Do shards contain the index data while the location (hdfs) contains
the documents (what lucene referred to them as)? I read that the shard
contains the index while the fs contains the data... I just wasn't quiet
sure what the data was, because when I work with lucene, the index
directory contains the data as a document.

The shard is stored in HDFS, and it is a Lucene index.  We store the data
inside the Lucene index, so it's basically Lucene all the way down to HDFS.
Ok, so basically a controller is a service which connects to all (or some?) shards a distributed query, which tells the shard to run a query against a certain data set, that shard then gets that data set either from memory or from the hadoop cluster, processes it, and returns the result to the controller which condenses the results from all the queried shards into a final result right?

Hope this helps.  Let us know if you have more questions.

Thanks,
Aaron



Thanks,
Colton McInroy

  * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com

On 9/26/2013 6:56 AM, Aaron McCurry wrote:

Colton,

First off welcome, hope we can help you get started or at least past this
issue.  The the binary artifact of Blur bundles hadoop to try and get
people up and running as quickly as possible.  Obviously it has not done
it's job in this case.  ;-)

Could you please post the shard and controller logs?  Also are you running
hadoop 1.x (CDH3) or hadoop 2.x (CHD4)?  Or are you just trying to get it
to work with the internal embedded version of hadoop?  Or all of the
above?  Also have you set anything in the blur-env.sh file or is it the
vanilla version?  Just trying to cover all the bases here.  I just walked
through the getting started guide on centos 6.x and last night I walked
through it on ubuntu 12.03? (I think) the LTS version with success.

I have found 2 issues with it, but neither seem related to the error you
are getting.

- One is an error with the default MaxDirectMemorySize setting in the
blur-env.sh, if left at the default it will throw an OOM exception when it
gets full (which shouldn't happen), but set larger it works as expected.
- The other is related to some status jsp pages in the daemon processes.

Let me know.

Thanks,
Aaron


On Thu, Sep 26, 2013 at 8:01 AM, Colton McInroy <[email protected]
wrote:
  Hello All,
      I am trying to get blur operating on a virtual environment using
gentoo as a base. I have compiled the latest version of blur as well as
tried the 0.2.0 tagged version. Each time I go to try and create a
table, I
get an IOException saying the table already exists. Here is a basic
example
of what I have done...

hadoop@blur ~ $ tar xfz apache-blur-0.3.0-incubating-***
*SNAPSHOT-bin.tar.gz
hadoop@blur ~ $ mv apache-blur-0.3.0-incubating-****SNAPSHOT-bin blur

hadoop@blur ~ $ cd blur
hadoop@blur ~/blur $ bin/start-all.sh
localhost: ZooKeeper starting as process 9536.
localhost: Shard [0] starting as process 9598.
localhost: Controller [0] starting as process 9660.
hadoop@blur ~/blur $ bin/blur shell
blur (default)> create -t testtable -c 11 -l file:///tmp/testtable
java.io.IOException: Table [testtable] already exists.
blur (default)> enable testtable
java.io.IOException: Table [testtable] already enabled.
blur (default)> disable testtable
java.io.IOException: Table [testtable] already disabled.
blur (default)> remove testtable
blur (default)>

As one can see from the above commands, this is a fresh attempt at
starting blur as the instructions on the site provide. I have tried the
following versions of jdk with the same problem...

    [1]   icedtea-bin-6
    [2]   icedtea-bin-7
    [3]   oracle-jdk-bin-1.7
    [4]   sun-jdk-1.6

      Between attempts I made sure to delete /tmp/zk_data to avoid any
possible problems it was causing. Each time I would experience the same
problem when trying to create a table as a test. I tried this on a ubuntu
virtualbox and it worked. I don't understand why there would be a
difference between the two considering I tried the same versions of java.
The ubuntu system was using openjdk 7 which is what gentoo calls
icedtea-bin-7.

      When using debug and timed responses, here is what gets outputted
when
typing the same commands...

blur (default)> debug
debugging is now on
blur (default)> timed
timing of commands is now on
Last command took 0ms
blur (default)> create -t testtable -c 11 -l file:///tmp/testtable
TableDescriptor(enabled:true, shardCount:11, tableUri:file:///tmp/****
testtable,

cluster:default, name:testtable, similarityClass:null, blockCaching:true,
blockCachingFileTypes:null, readOnly:false, preCacheCols:null,
tableProperties:null, strictTypes:false, defaultMissingFieldType:text,
defaultMissingFieldLessIndexin****g:true, defaultMissingFieldProps:null)

java.io.IOException: Table [testtable] already exists.
BlurException(message:java.io.****IOException: Table [testtable] already
exists., stackTraceStr:java.lang.****RuntimeException:
java.io.IOException:
Table [testtable] already exists.
          at org.apache.blur.manager.****clusterstatus.**
ZookeeperClusterStatus.****createTable(****
ZookeeperClusterStatus.java:****744)
          at org.apache.blur.thrift.****TableAdmin.createTable(**
TableAdmin.java:101)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke0(Native
Method)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke(**
NativeMethodAccessorImpl.java:****57)
          at sun.reflect.****DelegatingMethodAccessorImpl.****invoke(**
DelegatingMethodAccessorImpl.****java:43)
          at java.lang.reflect.Method.****invoke(Method.java:606)
          at org.apache.blur.utils.****BlurUtil$1.invoke(BlurUtil.****
java:183)
          at com.sun.proxy.$Proxy0.****createTable(Unknown Source)
          at org.apache.blur.thrift.****generated.Blur$Processor$**
createTable.getResult(Blur.****java:2402)
          at org.apache.blur.thrift.****generated.Blur$Processor$**
createTable.getResult(Blur.****java:2386)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.ProcessFunction.****
process(ProcessFunction.java:****54)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.TBaseProcessor.**
process(TBaseProcessor.java:****57)
          at org.apache.blur.thrift.server.**
**AbstractNonblockingServer$**
FrameBuffer.invoke(****AbstractNonblockingServer.****java:515)
          at org.apache.blur.thrift.server.**
**Invocation.run(Invocation.**
java:34)
          at org.apache.blur.concurrent.****ThreadWatcher$**
ThreadWatcherExecutorService$****1.run(ThreadWatcher.java:127)
          at java.util.concurrent.****ThreadPoolExecutor.runWorker(****
ThreadPoolExecutor.java:1145)
          at java.util.concurrent.****ThreadPoolExecutor$Worker.run(****
ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.****java:724)

Caused by: java.io.IOException: Table [testtable] already exists.
          at org.apache.blur.manager.****clusterstatus.**
ZookeeperClusterStatus.****createTable(****
ZookeeperClusterStatus.java:****722)

          ... 17 more
, errorType:UNKNOWN)
          at org.apache.blur.thrift.****generated.Blur$createTable_**
result$createTable_****resultStandardScheme.read(****Blur.java:3818)
          at org.apache.blur.thrift.****generated.Blur$createTable_**
result$createTable_****resultStandardScheme.read(****Blur.java:3804)
          at org.apache.blur.thrift.****generated.Blur$createTable_**
result.read(Blur.java:3754)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.TServiceClient.**
receiveBase(TServiceClient.****java:78)
          at org.apache.blur.thrift.****generated.Blur$Client.recv_**
createTable(Blur.java:458)
          at org.apache.blur.thrift.****generated.Blur$Client.**
createTable(Blur.java:445)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke0(Native
Method)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke(**
NativeMethodAccessorImpl.java:****57)
          at sun.reflect.****DelegatingMethodAccessorImpl.****invoke(**
DelegatingMethodAccessorImpl.****java:43)
          at java.lang.reflect.Method.****invoke(Method.java:606)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler$1.****call(BlurClient.java:59)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler$1.****call(BlurClient.java:55)
          at org.apache.blur.thrift.****AbstractCommand.call(**
AbstractCommand.java:62)
          at org.apache.blur.thrift.****BlurClientManager.execute(**
BlurClientManager.java:167)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler.****invoke(BlurClient.java:55)
          at com.sun.proxy.$Proxy0.****createTable(Unknown Source)
          at org.apache.blur.shell.****CreateTableCommand.doit(**
CreateTableCommand.java:100)
          at org.apache.blur.shell.Main.****main(Main.java:471)

Last command took 61289ms
blur (default)> enable testtable
java.io.IOException: Table [testtable] already enabled.
BlurException(message:java.io.****IOException: Table [testtable] already
enabled., stackTraceStr:java.lang.****RuntimeException:

java.io.IOException: Table [testtable] already enabled.
          at org.apache.blur.manager.****clusterstatus.**
ZookeeperClusterStatus.****enableTable(****
ZookeeperClusterStatus.java:****809)
          at org.apache.blur.thrift.****TableAdmin.enableTable(**
TableAdmin.java:137)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke0(Native
Method)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke(**
NativeMethodAccessorImpl.java:****57)
          at sun.reflect.****DelegatingMethodAccessorImpl.****invoke(**
DelegatingMethodAccessorImpl.****java:43)
          at java.lang.reflect.Method.****invoke(Method.java:606)
          at org.apache.blur.utils.****BlurUtil$1.invoke(BlurUtil.****
java:183)
          at com.sun.proxy.$Proxy0.****enableTable(Unknown Source)
          at org.apache.blur.thrift.****generated.Blur$Processor$**
enableTable.getResult(Blur.****java:2426)
          at org.apache.blur.thrift.****generated.Blur$Processor$**
enableTable.getResult(Blur.****java:2410)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.ProcessFunction.****
process(ProcessFunction.java:****54)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.TBaseProcessor.**
process(TBaseProcessor.java:****57)
          at org.apache.blur.thrift.server.**
**AbstractNonblockingServer$**
FrameBuffer.invoke(****AbstractNonblockingServer.****java:515)
          at org.apache.blur.thrift.server.**
**Invocation.run(Invocation.**
java:34)
          at org.apache.blur.concurrent.****ThreadWatcher$**
ThreadWatcherExecutorService$****1.run(ThreadWatcher.java:127)
          at java.util.concurrent.****ThreadPoolExecutor.runWorker(****
ThreadPoolExecutor.java:1145)
          at java.util.concurrent.****ThreadPoolExecutor$Worker.run(****
ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.****java:724)

Caused by: java.io.IOException: Table [testtable] already enabled.
          at org.apache.blur.manager.****clusterstatus.**
ZookeeperClusterStatus.****enableTable(****
ZookeeperClusterStatus.java:****805)

          ... 17 more
, errorType:UNKNOWN)
          at org.apache.blur.thrift.****generated.Blur$enableTable_**
result$enableTable_****resultStandardScheme.read(****Blur.java:4540)
          at org.apache.blur.thrift.****generated.Blur$enableTable_**
result$enableTable_****resultStandardScheme.read(****Blur.java:4526)
          at org.apache.blur.thrift.****generated.Blur$enableTable_**
result.read(Blur.java:4476)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.TServiceClient.**
receiveBase(TServiceClient.****java:78)
          at org.apache.blur.thrift.****generated.Blur$Client.recv_**
enableTable(Blur.java:481)
          at org.apache.blur.thrift.****generated.Blur$Client.**
enableTable(Blur.java:468)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke0(Native
Method)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke(**
NativeMethodAccessorImpl.java:****57)
          at sun.reflect.****DelegatingMethodAccessorImpl.****invoke(**
DelegatingMethodAccessorImpl.****java:43)
          at java.lang.reflect.Method.****invoke(Method.java:606)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler$1.****call(BlurClient.java:59)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler$1.****call(BlurClient.java:55)
          at org.apache.blur.thrift.****AbstractCommand.call(**
AbstractCommand.java:62)
          at org.apache.blur.thrift.****BlurClientManager.execute(**
BlurClientManager.java:167)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler.****invoke(BlurClient.java:55)
          at com.sun.proxy.$Proxy0.****enableTable(Unknown Source)
          at org.apache.blur.shell.****EnableTableCommand.doit(**
EnableTableCommand.java:37)
          at org.apache.blur.shell.Main.****main(Main.java:471)

Last command took 3ms
blur (default)> disable testtable
java.io.IOException: Table [testtable] already disabled.
BlurException(message:java.io.****IOException: Table [testtable] already
disabled., stackTraceStr:java.lang.****RuntimeException:

java.io.IOException: Table [testtable] already disabled.
          at org.apache.blur.manager.****clusterstatus.**
ZookeeperClusterStatus.****disableTable(****
ZookeeperClusterStatus.java:****784)
          at org.apache.blur.thrift.****TableAdmin.disableTable(**
TableAdmin.java:120)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke0(Native
Method)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke(**
NativeMethodAccessorImpl.java:****57)
          at sun.reflect.****DelegatingMethodAccessorImpl.****invoke(**
DelegatingMethodAccessorImpl.****java:43)
          at java.lang.reflect.Method.****invoke(Method.java:606)
          at org.apache.blur.utils.****BlurUtil$1.invoke(BlurUtil.****
java:183)
          at com.sun.proxy.$Proxy0.****disableTable(Unknown Source)
          at org.apache.blur.thrift.****generated.Blur$Processor$**
disableTable.getResult(Blur.****java:2450)
          at org.apache.blur.thrift.****generated.Blur$Processor$**
disableTable.getResult(Blur.****java:2434)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.ProcessFunction.****
process(ProcessFunction.java:****54)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.TBaseProcessor.**
process(TBaseProcessor.java:****57)
          at org.apache.blur.thrift.server.**
**AbstractNonblockingServer$**
FrameBuffer.invoke(****AbstractNonblockingServer.****java:515)
          at org.apache.blur.thrift.server.**
**Invocation.run(Invocation.**
java:34)
          at org.apache.blur.concurrent.****ThreadWatcher$**
ThreadWatcherExecutorService$****1.run(ThreadWatcher.java:127)
          at java.util.concurrent.****ThreadPoolExecutor.runWorker(****
ThreadPoolExecutor.java:1145)
          at java.util.concurrent.****ThreadPoolExecutor$Worker.run(****
ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.****java:724)

Caused by: java.io.IOException: Table [testtable] already disabled.
          at org.apache.blur.manager.****clusterstatus.**
ZookeeperClusterStatus.****disableTable(****
ZookeeperClusterStatus.java:****780)

          ... 17 more
, errorType:UNKNOWN)
          at org.apache.blur.thrift.****generated.Blur$disableTable_**
result$disableTable_****resultStandardScheme.read(****Blur.java:5262)
          at org.apache.blur.thrift.****generated.Blur$disableTable_**
result$disableTable_****resultStandardScheme.read(****Blur.java:5248)
          at org.apache.blur.thrift.****generated.Blur$disableTable_**
result.read(Blur.java:5198)
          at org.apache.blur.thirdparty.****
thrift_0_9_0.TServiceClient.**
receiveBase(TServiceClient.****java:78)
          at org.apache.blur.thrift.****generated.Blur$Client.recv_**
disableTable(Blur.java:504)
          at org.apache.blur.thrift.****generated.Blur$Client.**
disableTable(Blur.java:491)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke0(Native
Method)
          at sun.reflect.****NativeMethodAccessorImpl.****invoke(**
NativeMethodAccessorImpl.java:****57)
          at sun.reflect.****DelegatingMethodAccessorImpl.****invoke(**
DelegatingMethodAccessorImpl.****java:43)
          at java.lang.reflect.Method.****invoke(Method.java:606)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler$1.****call(BlurClient.java:59)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler$1.****call(BlurClient.java:55)
          at org.apache.blur.thrift.****AbstractCommand.call(**
AbstractCommand.java:62)
          at org.apache.blur.thrift.****BlurClientManager.execute(**
BlurClientManager.java:167)
          at org.apache.blur.thrift.****BlurClient$**
BlurClientInvocationHandler.****invoke(BlurClient.java:55)
          at com.sun.proxy.$Proxy0.****disableTable(Unknown Source)
          at org.apache.blur.shell.****DisableTableCommand.doit(**
DisableTableCommand.java:35)
          at org.apache.blur.shell.Main.****main(Main.java:471)

Last command took 61285ms
blur (default)> remove testtable
Last command took 38ms
blur (default)>

      I emailed Aaron directly who suggested I make sure I clear the
/tmp/zk_data directory, and also requested that I submit this to the
mailing list, so here it is.

      Any help in resolving this would be greatly appreciated.

--
Thanks,
Colton McInroy

   * Director of Security Engineering


Phone
(Toll Free)
_US_    (888)-818-1344 Press 2
_UK_    0-800-635-0551 Press 2

My Extension    101
24/7 Support    [email protected] <mailto:[email protected]>
Email   [email protected] <mailto:[email protected]>
Website         http://www.dosarrest.com




Reply via email to