[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

Kevin Risden (JIRA) Wed, 20 Dec 2017 07:17:15 -0800

    [ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298628#comment-16298628
 ]


Kevin Risden commented on RANGER-1938:
--------------------------------------

bq. If Apache Ambari is used to manage Ranger, then Ambari has its own template 
for solr-config

I think this only takes affect during initial install if using Solr cloud. The 
template isn't pushed back up to Zookeeper. I know my changes haven't been 
overwritten by Ambari (which I was a bit surprised by).

Does it make sense for me to open a JIRA against the Ambari project as well to 
fix this schema? I can link to this JIRA since they are related.

bq. If you are using Ambari-Infra Solr, then it is shared by Ambari LogSearch 
and Apache Atlas

It is true that Ambari Infra Solr is shared by Ranger, LogSearch, and Atlas. 
The modified Solr config is only for the single collection which is 
ranger_audit in this case. This could also be a problem with LogSearch and 
Atlas and would have to be fixed separately. We don't have LogSearch enabled 
right now and Atlas has too small of a dataset for us right now to notice a 
problem.

bq. We have to emphasize that changing the schema requires rebuilding of Solr 
Collection. Which means all existing audits from Solr will be deleted. There is 
an option to rebuild it from the audits stored in HDFS, but currently, there 
are no known documentation or scripts for that

Since the TTL is small, deleting is what I recommend since it avoids the 
complexity of reindexing from HDFS. If necessary definitely could reindex from 
HDFS but as you said this would require more effort.
 
bq. I have given review comments at review board. Essentially, I would prefer 
to set the docValues at individual field level, rather than at global/default 
fieldType level.

I don't think there is a great reason to not enable at the "global" fieldType 
level. We can disable at the individual field level if necessary. There are 
very few cases where turning off DocValues makes sense especially for Ranger. 
DocValues will take up more disk space during indexing but the tradeoff is that 
Solr is stable during any sorting or faceting operation. Most of the queries 
I've seen for Ranger use sorting in the Ranger Admin UI. The thing to remember 
here is that DocValues is at the collection level global and not for Solr 
global. Each collection can have a different configuration.

bq. I have a request. I know it is very difficult to recommend configuration 
settings for Solr. With your current setup, can you share the configuration you 
currently have?

1. Memory setting for Solr = *4GB (after change for this JIRA. Previously had 
up to 24GB and Solr wasn't stable over time.)*
2. Number of shards and replication = *5 shards - no replication right now 
(single Solr node)*
3. Number of days for TTL = *21 days*
4. Max number of documents (based on TTL) = *~1.33 billion live docs and ~110 
million deleted docs for 21 days. Split over 5 shards - each shard ~266 million 
live docs and ~22 million deleted docs right now.*
5. Are Solr instances running on dedicated servers or one of the master 
servers? = *1 Solr instance running on a data node with spinning disk (not 
using HDFS right now)*

Some other stats that could be helpful:
* very few queries against Ambari Infra Solr - Ranger Admin UI and Atlas UI 
(very rarely)
* default GC tuning from Solr scripts are used
* current uptime 2 weeks - no full GC pauses with 4GB heap
* average pause time is .00118s
* average pause interval is 15.41s
* average heap usage 2GB


> Solr for Audit setup doesn't use DocValues effectively
> ------------------------------------------------------
>
>                 Key: RANGER-1938
>                 URL: https://issues.apache.org/jira/browse/RANGER-1938
>             Project: Ranger
>          Issue Type: Bug
>          Components: audit
>    Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>            Reporter: Kevin Risden
>            Assignee: Kevin Risden
>              Labels: performance
>             Fix For: 1.0.0, 0.7.2
>
>         Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

Reply via email to