Filtering group query results

2018-10-04 Thread Greenhorn Techie
Hi,

We have a requirement where we need to perform a group query in Solr where
results are grouped by user-name (which is a field in our indexes) . We
then need to filter the results based on numFound response parameter
present under each group. In essence, we want to return results only where
numFound=1.

Looking into the documentation, I couldn’t figure out any mechanism to
achieve this. So wondering if there is a possibility to achieve this
requirement with the existing building blocks of Solr query mechanism.

Thanks


Re: Multiple Queries per request

2018-10-02 Thread Greenhorn Techie
Shamik,

Wondering how to get this working? As I mentioned, my data is different for
each of the wizards. So not sure how to "return all the necessary data at
one shot and group them”

Any particular inputs?

Thanks


On 2 October 2018 at 15:47:50, Shamik Sinha (shamikchand...@gmail.com)
wrote:

The Solr uses REST based calls which is done over http or https which
cannot handle multiple requests at one shot. However what you can do is
return all the necessary data at one shot and group them according to your
needs.
Thanks and regards,
Shamik


On 02-Oct-2018 8:11 PM, "Greenhorn Techie" 
wrote:

Hi,

We are building a mobile app which would display results from Solr. At the
moment, the idea is to have multiple widgets / areas on the mobile screen,
with each area being served by a distinct Solr query. For example first
widget would be display customer’s aggregated product usage, second widget
to display time-windows during which they are ore active on the app.

As these two widgets have different field list and query parameters, I was
wondering whether I can make a single call into Solr, which would then be
sending the results catering to each widget separately. I have gone through
the mail archive, but could not determine whether this is possible or not
an option in solr.

Any thoughts from the awesome community?

Thanks


Multiple Queries per request

2018-10-02 Thread Greenhorn Techie
Hi,

We are building a mobile app which would display results from Solr. At the
moment, the idea is to have multiple widgets / areas on the mobile screen,
with each area being served by a distinct Solr query. For example first
widget would be display customer’s aggregated product usage, second widget
to display time-windows during which they are ore active on the app.

As these two widgets have different field list and query parameters, I was
wondering whether I can make a single call into Solr, which would then be
sending the results catering to each widget separately. I have gone through
the mail archive, but could not determine whether this is possible or not
an option in solr.

Any thoughts from the  awesome community?

Thanks


Metrics for a healthy Solr cluster

2018-08-16 Thread Greenhorn Techie
Hi,

Solr provides numerous JMX metrics for monitoring the health of the
cluster. We are setting up a SolrCloud cluster and hence wondering what are
the important parameters / metrics to look into, to ascertain that the
cluster health is good. Obvious things comes to my mind are CPU utilisation
and memory utilisation.

However, wondering what are the other parameters to look into from the
health of the cluster? Are there any best practices?

Thanks


Re: Calculating maxShardsPerNode

2018-08-13 Thread Greenhorn Techie
Thanks Erick!


On 13 August 2018 at 17:05:57, Erick Erickson (erickerick...@gmail.com)
wrote:

I wouldn't spend a lot of time worrying about this, just set it to a
big number ;).

Solr won't create that many replicas (and the name is a bit confusing)
unless you ask it to. It'll also distribute the replicas across nodes
rather than putting them all together, and in the worst case you can
place them individually yourself.

This is more a safety valve to keep from doing ridiculous things, like
telling Solr to create 1,000 replicas on 2 hosts or something.

Best,
Erick

On Mon, Aug 13, 2018 at 8:37 AM, Greenhorn Techie
 wrote:
> Hi,
>
> Our cluster is a 20 node with numShards expected to be set to 10 and
> replication expected to be 4. Wondering what is the best value to
> set maxShardsPerNode to? Should I consider only numShards while
calculating
> the value i.e. because I have only 10 shards, should I set
maxShardsPerNode
> to 1or at number of physical replicas i.e. numShards * replicationFactor.
> So should I set maxShardsPerNode to 2 because the total physical replicas
> are 40 (numShards * replicationFactor) while the number of nodes are only
> 20?
>
> Please let me know your thoughts.
>
> Thanks


Calculating maxShardsPerNode

2018-08-13 Thread Greenhorn Techie
Hi,

Our cluster is a 20 node with numShards expected to be set to 10 and
replication expected to be 4. Wondering what is the best value to
set maxShardsPerNode to? Should I consider only numShards while calculating
the value i.e. because I have only 10 shards, should I set maxShardsPerNode
to 1or at number of physical replicas i.e. numShards * replicationFactor.
So should I set maxShardsPerNode to 2 because the total physical replicas
are 40 (numShards * replicationFactor) while the number of nodes are only
20?

Please let me know your thoughts.

Thanks


Re: CDCR traffic

2018-07-09 Thread Greenhorn Techie
Amrit,

Further to the below conversation:

As I understand, Solr supports SSL encryption between nodes within a Solr
cluster and as well communications to and from clients. In the case of
CDCR, assuming both the source and target clusters are SSL enabled, can we
say that the source clusters’ shard leaders act as clients to the target
cluster and hence the data is encrypted while its transmitted between the
clusters?

Thanks


On 25 June 2018 at 15:56:07, Amrit Sarkar (sarkaramr...@gmail.com) wrote:

Hi Rajeswari,

No it is not. Source forwards the update to the Target in classic manner.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Fri, Jun 22, 2018 at 11:38 PM, Natarajan, Rajeswari <
rajeswari.natara...@sap.com> wrote:

> Hi,
>
> Would like to know , if the CDCR traffic is encrypted.
>
> Thanks
> Ra
>


Re: Solr Kerberos Authentication

2018-07-09 Thread Greenhorn Techie
Hi,

Any thoughts on this please?

Thanks


On 5 July 2018 at 15:06:26, Greenhorn Techie (greenhorntec...@gmail.com)
wrote:

Hi,

In the solr documentation, it is mentioned that blockUnknown property for
Authentication plugin has the default value of false, which means any
authenticated users will be allowed to use Solr. However, wondering whether
this parameter only makes sense for Basic Authentication only or does it
impact Kerberos authentication as well?

I couldn’t find any Kerberos plugin example in the documentation where the
blockUnknown parameter has been set or defined. Hence my question.

Thanks


Re: Unbale to Create a Core

2018-07-06 Thread Greenhorn Techie
Erick,

Good Evening!!

A question further on the below. If schema-oriented design is recommended
for production systems, then how should we design such that it production
systems would cater for inevitable schema changes? Should we reindex the
data and rebuild the collections again?

Thanks


On 6 July 2018 at 15:50:57, Erick Erickson (erickerick...@gmail.com) wrote:

"df" is not defaultFieldType, it's the default search field.

It looks like you're using "schemaless" mode. defaultFieldType is part
of that process.

Incidentally, we don't recommend schemaless for production systems as
it has to make assumptions about how you want to search. It's fine for
getting started, but for production it's usually better to take
control of your schema explicitly.

Best,
Erick

On Fri, Jul 6, 2018 at 6:41 AM, neotorand  wrote:
> Hi List,
> I am unable to create a core.Unable to figure out what wrong.
> I get below error.
>
> ERROR: Failed to create collection 'XXX' due to:
>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
> from server at
> http://xyz.com:8983/solr:
> Error CREATEing SolrCore 'docpocc_shard1_replica1':
> Unable to create core [docpocc_shard1_replica1] Caused by: Missing
required
> init param 'defaultFieldType'
>
> in my solr config file i have the init param as below
>
>  path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
> 
> _text_
> 
> 
>
> Any help or pointers.Thanks in advance.
>
>
> Regards
> Neo
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Kerberos Authentication

2018-07-05 Thread Greenhorn Techie
Hi,

In the solr documentation, it is mentioned that blockUnknown property for
Authentication plugin has the default value of false, which means any
authenticated users will be allowed to use Solr. However, wondering whether
this parameter only makes sense for Basic Authentication only or does it
impact Kerberos authentication as well?

I couldn’t find any Kerberos plugin example in the documentation where the
blockUnknown parameter has been set or defined. Hence my question.

Thanks


CloudSolrClient - setDefaultCollection

2018-06-21 Thread Greenhorn Techie
Hi,

While indexing, is there going to be any performance benefit to set the
collection name first using setDefaultCollection

(String

collection)
method and then index the document
using cloudClient.add(solrInputDoc), instead of suing the method
cloudClient.add(collectionName, solrInputDoc)?

Is this performance benefit to consider or is this mere of a convenience /
better looking code?

Thanks


Re: Solr start script

2018-06-07 Thread Greenhorn Techie
Shawn, Thanks for your response. Please find my follow-up questions:

1. My understanding is that Directory Factory settings are typically at a
collection / core level. If thats the case, what is the advantage of
passing it along with the start script?
2. In your below response, did you mean that even though I pass the
settings as part of start script, they dont have any value unless they are
mentioned as part of the solrconfig.xml file?
3. As per my previous email, what does Solr do if my solfconfig.xml contain
NRTDirectoryFactory setting while the solr script is started with HDFS
settings?

Thanks


On 7 June 2018 at 15:08:02, Shawn Heisey (apa...@elyograg.org) wrote:

On 6/7/2018 7:37 AM, Greenhorn Techie wrote:
> When the above settings are passed as part of start script, does that
mean
> whenever a new collection is created, Solr is going to store the indexes
in
> HDFS? But what if I upload my solrconfig.xml to ZK which contradicts with
> this and contains NRTDirectoryFactory setting? Given the above start
> script, should / could I skip the directory factory setting section in my
> solrconfig.xml with the assumption that the collections are going to be
> stored on HDFS *by default*?

Those commandline options are Java system properties.  It looks like the
example configs DO have settings in them that would use the
solr.directoryFactory and solr.lock.type properties.  But if your
solrconfig.xml file doesn't reference those properties, then they
wouldn't make any difference.  The last one is probably a setting that
HdfsDirectoryFactory uses that doesn't need to be explicitly referenced
in a config file.

Thanks,
Shawn


Re: HDP Search - Configuration & Data Directories

2018-06-07 Thread Greenhorn Techie
Thanks Shawn. Will check with Hortonworks!


On 7 June 2018 at 14:19:43, Shawn Heisey (apa...@elyograg.org) wrote:

On 6/7/2018 6:35 AM, Greenhorn Techie wrote:
> A quick question on configuring Solr with Hortonworks HDP. I have
installed
> HDP and then installed HDP Search using the steps described under the
link



> - Within the various Solr config settings on Ambari, I am a bit confused
> on the role of "solr_config_conf_dir" parameter. At the moment, it only
> contains log4j.properties file. As HDPSearch is mainly meant to be used
> with SolrCloud, wondering what is the significance of this directory as
the
> configurations are always maintained on ZooKeeper.

The text strings "solr_config_conf_dir" and "solr_config_data_dir" do
not appear anywhere in the Lucene/Solr source code, even if I use a
case-insensitive grep. Which must mean that it is specific to the
third-party software you are using.  You'll need to ask your question to
the people who make that third-party software.

The log4j config is not in zookeeper.  That will be found on each
server.  That file configures the logging framework at the JVM level, it
is not specifically for Solr.

Thanks,
Shawn


Solr start script

2018-06-07 Thread Greenhorn Techie
Hi,

For our project purposes, we need to store Solr collections on HDFS.  While
exploring the documentation for the same, I have found lucidworks
documentation (
https://doc.lucidworks.com/lucidworks-hdpsearch/3.0.0/Guide-Install-Manual.html#hdfs-specific-changes)
, where it has been mentioned that solr start script can be passed many
arguments while starting. The example provided is as below:

bin/solr start -c
   -z 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/solr
   -Dsolr.directoryFactory=HdfsDirectoryFactory
   -Dsolr.lock.type=hdfs
   -Dsolr.hdfs.home=hdfs://sandbox.hortonworks.com:8020/user/solr


What does this actually mean when passing directoryFactory settings for
Solr start script? I was thinking Directory Factory setting is something
that apply only at each collection level i.e. we need to specify within the
solrconfig.xml file *only*.

When the above settings are passed as part of start script, does that mean
whenever a new collection is created, Solr is going to store the indexes in
HDFS? But what if I upload my solrconfig.xml to ZK which contradicts with
this and contains NRTDirectoryFactory setting? Given the above start
script, should / could I skip the directory factory setting section in my
solrconfig.xml with the assumption that the collections are going to be
stored on HDFS *by default*?

This is confusing to me and hence need the expert advice of the community.

Thanks


Running Solr on HDFS - Disk space

2018-06-07 Thread Greenhorn Techie
Hi,

As HDFS has got its own replication mechanism, with a HDFS replication
factor of 3, and then SolrCloud replication factor of 3, does that mean
each document will probably have around 9 copies replicated underneath of
HDFS? If so, is there a way to configure HDFS or Solr such that only three
copies are maintained overall?

Thanks


HDP Search - Configuration & Data Directories

2018-06-07 Thread Greenhorn Techie
Hi,

A quick question on configuring Solr with Hortonworks HDP. I have installed
HDP and then installed HDP Search using the steps described under the link
-
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_solr-search-installation/content/hdp-search30-install-mpack.html


I have used the link from lucidworks to configure various parameters
exposed on Ambari -
https://doc.lucidworks.com/lucidworks-hdpsearch/3.0.0/Guide-Install-Ambari.html#_startup-option-reference


   - Within the various Solr config settings on Ambari, I am a bit confused
   on the role of "solr_config_conf_dir" parameter. At the moment, it only
   contains log4j.properties file. As HDPSearch is mainly meant to be used
   with SolrCloud, wondering what is the significance of this directory as the
   configurations are always maintained on ZooKeeper.
   - Another question is when the indexes for SolrCloud collections are
   stored on HDFS, what is the significance of "solr_config_data_dir"? Is
   the solr_config_data_dir directory used ONLY for collections for which
   directory factory settings are set to local? If so, is it safe to assume
   that this is not needed when HDFS is being used?

Thanks​


Re: SolrCloud Collection Backup - Solr 5.5.4

2018-06-04 Thread Greenhorn Techie
Thanks Shawn for your detailed reply. It has helped to better my
understanding. Below is my summarised understanding.

In a SolrCloud setup with version less than 6.1, there is no ‘elegant’ way
of handling collection backups and restore. Instead, have to use the manual
backup and restore APIs using replication handler. However, as these APIs
were primarily designed for standalone Solr installations, we can only
backup data stored on a single Solr host for a particular core. Hence, in
order to get the complete collection data backed-up for a SolrCloud
collection, backup API should be used for all the nodes belonging to the
SolrCloud cluster and then manually backup ZooKeeper clusterstate, with
possible tweaking needed to ensure hash value consistency.

Few follow-up questions:
1. In the SolrCloud, as a single host can have information about multiple
shards (either leader or replica), how does the backup API handle the
underlying data copy? I presume it will simply copy the data across ALL the
shards (both leader and replicas) for the specified collection.
2. If I am invoking the backup command periodically to backup the data and
then invoke restore command later (possibly due to cluster shutdown and
create a fresh SolrCloud cluster), I presume I don't need to tinker with
the hash values as long as the default settings have been used in both
backup and restore situations?

Thanks


On 2 June 2018 at 08:59:26, Shawn Heisey (apa...@elyograg.org) wrote:

On 6/2/2018 1:50 AM, Shawn Heisey wrote:
> If you provide a location parameter, it will write a new backup
> directory in that location.
>
>
https://lucene.apache.org/solr/guide/6_6/making-and-restoring-backups.html#standalone-mode-backups
>
> I verified that this parameter is in the 5.5 docs too, I would suggest
> you download that version in PDF format if you want a full reference.

A followup:

I suspect that if you try to use the restore functionality on the
replication handler and have multiple shard replicas, that SolrCloud
would not replicate things properly.  I could be wrong about that, but I
think that restoring from replication handler backups to SolrCloud could
get a little messy.

Thanks,
Shawn


SolrCloud Collection Backup - Solr 5.5.4

2018-06-01 Thread Greenhorn Techie
Hi,

We are running SolrCloud with version 5.5.4. As I understand, Solr
Collection Backup and Restore API are only supported from version 6
onwards. So wondering what is the best mechanism to get our collections
backed-up on older Solr version.

When I ran backup command on a particular node (curl
http://localhost:8983/solr/gettingstarted/replication?command=backup) it
seems it only creates a snapshot for the collection data stored on that
particular node. Does that mean, if I run this command for every node
hosting my SolrCloud collection, I will be getting the required backup?
Will this backup the metadata as well from ZK? I presume not. If so, what
are the best possible approaches to get the same. Is there something made
available by Solr for the same?

Thanks


Impact of timeAllowed parameter

2018-05-31 Thread Greenhorn Techie
Hi,

Wondering how would be the calling application informed that the search
request has been impacted due to time-out vs it has completed normally? Is
there something that is sent to the client as part of the response that
time-out has been invoked?

Thanks


Re: Navigating through Solr Source Code

2018-05-21 Thread Greenhorn Techie
Thanks for your responses.

Best Regards!


On 21 May 2018 at 16:40:10, Shawn Heisey (apa...@elyograg.org) wrote:

On 5/21/2018 4:35 AM, Greenhorn Techie wrote:
> As the documentation around Solr is limited, I am thinking to go through
> the source code and understand the various bits and pieces. However, I am
a
> bit confused on where to start as I my developing skills are a bit
limited.
>
> Any thoughts on how best to start / where to start looking into Solr
source
> code?

As Erick has said, the rabbit hole is very deep.I've been looking into
it for a few years now.  There are parts of it that are a complete mystery.

Depending on exactly what you're looking for, one approach is to examine
the SolrDispatchFilter class.  This is the entry point from the servlet
container for most HTTP requests, and a lot of Solr's startup
initialization is found there.

The solr/webapp/web/WEB-INF/web.xml file in the source code checkout is
what loads SolrDispatchFilter and a few other classes when Solr starts.

Thanks,
Shawn


Navigating through Solr Source Code

2018-05-21 Thread Greenhorn Techie
Hi,

As the documentation around Solr is limited, I am thinking to go through
the source code and understand the various bits and pieces. However, I am a
bit confused on where to start as I my developing skills are a bit limited.

Any thoughts on how best to start / where to start looking into Solr source
code?

Thanks


Re: SolrCloud replicaition

2018-05-03 Thread Greenhorn Techie
Perfect! Thanks Erick and Shalin!!


On 3 May 2018 at 16:13:06, Erick Erickson (erickerick...@gmail.com) wrote:

Shalin's right, I was hurried in my response and forgot that the
min_rf just _allows_ the client to figure out that the update didn't
get updated on enough replicas and the client has to "do the right
thing" with that information, thanks Shalin!

Right, your scenario is correct. When the follower goes back to
"active" and starts serving queries it will be all caught up with the
leader, including any missed documents.

Your step 4, the client gets a success response since the document was
indexed successfully on the leader. There's some additional
information in the response saying min_rf wasn't met and you should do
whatever you think appropriate. Stop indexing, retry, send a message
to your sysadmin, etc.

You can figure out exactly what by a pretty simple experiment, just
take one replica of a two-replica system down and specify min_rf of
2.

Best,
Erick

On Wed, May 2, 2018 at 9:20 PM, Greenhorn Techie
<greenhorntec...@gmail.com> wrote:
> Shalin,
>
> Given the earlier response by Erick, wondering when this scenario occurs
> i.e. when the replica node recovers after a time period, wouldn’t it
> automatically recover all the missed updates by connecting to the leader?
> My understanding is the below from the responses so far (assuming
> replication factor of 2 for simplicity purposes):
>
> 1. Client tries an update request which is received by the shard leader
> 2. Leader once it updates on its own node, send the update to the
> unavailable replica node
> 3. Leader keeps trying to send the update to the replica node
> 4. After a while leader gives up and communicates to the client (not sure
> what kind of message will the client receive in this case?)
> 5. Replica node recovers and then realises that it needs to catch-up and
> hence receives all the updates in recovery mode
>
> Correct me if I am wrong in my understanding.
>
> Thnx!!
>
>
> On 3 May 2018 at 04:10:12, Shalin Shekhar Mangar (shalinman...@gmail.com)
> wrote:
>
> The min_rf parameter does not fail indexing. It only tells you how many
> replicas received the live update. So if the value is less than what you
> wanted then it is up to you to retry the update later.
>
> On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techie <
greenhorntec...@gmail.com>
>
> wrote:
>
>> Hi,
>>
>> Good Morning!!
>>
>> In the case of a SolrCloud setup with sharing and replication in place,
>> when a document is sent for indexing, what happens when only the shard
>> leader has indexed the document, but the replicas failed, for whatever
>> reason. Will the document be resent by the leader to the replica shards
> to
>> index the document after sometime or how is scenario addressed?
>>
>> Also, given the above context, when I set the value of min_rf parameter
> to
>> say 2, does that mean the calling application will be informed that the
>> indexing failed?
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.


Re: SolrCloud replicaition

2018-05-02 Thread Greenhorn Techie
Shalin,

Given the earlier response by Erick, wondering when this scenario occurs
i.e. when the replica node recovers after a time period, wouldn’t it
automatically recover all the missed updates by connecting to the leader?
My understanding is the below from the responses so far (assuming
replication factor of 2 for simplicity purposes):

1. Client tries an update request which is received by the shard leader
2. Leader once it updates on its own node, send the update to the
unavailable replica node
3. Leader keeps trying to send the update to the replica node
4. After a while leader gives up and communicates to the client (not sure
what kind of message will the client receive in this case?)
5. Replica node recovers and then realises that it needs to catch-up and
hence receives all the updates in recovery mode

Correct me if I am wrong in my understanding.

Thnx!!


On 3 May 2018 at 04:10:12, Shalin Shekhar Mangar (shalinman...@gmail.com)
wrote:

The min_rf parameter does not fail indexing. It only tells you how many
replicas received the live update. So if the value is less than what you
wanted then it is up to you to retry the update later.

On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techie <greenhorntec...@gmail.com>

wrote:

> Hi,
>
> Good Morning!!
>
> In the case of a SolrCloud setup with sharing and replication in place,
> when a document is sent for indexing, what happens when only the shard
> leader has indexed the document, but the replicas failed, for whatever
> reason. Will the document be resent by the leader to the replica shards
to
> index the document after sometime or how is scenario addressed?
>
> Also, given the above context, when I set the value of min_rf parameter
to
> say 2, does that mean the calling application will be informed that the
> indexing failed?
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr Heap usage

2018-05-02 Thread Greenhorn Techie
Thanks Shawn for the inputs, which will definitely help us to scale our
cluster better.

Regards


On 2 May 2018 at 18:15:12, Shawn Heisey (apa...@elyograg.org) wrote:

On 5/1/2018 5:33 PM, Greenhorn Techie wrote:
> Wondering what are the considerations to be aware to arrive at an optimal
> heap size for Solr JVM? Though I did discuss this on the IRC, I am still
> unclear on how Solr uses the JVM heap space. Are there any pointers to
> understand this aspect better?

I'm one of the people you've been chatting with on IRC.

I also wrote the wiki page that Susheel has recommended to you.

> Given that Solr requires an optimally configured heap, so that the
> remaining unused memory can be used for OS disk cache, I wonder how to
best
> configure Solr heap. Also, on the IRC it was discussed that having 31GB
of
> heap is better than having 32GB due to Java’s internal usage of heap. Can
> anyone guide further on heap configuration please?

With the index size you mentioned on IRC, it's very difficult to project
how much heap you're going to need. Actually setting up a system,
putting data on it, and firing real queries at it may be the only way to
be sure.

The only concrete advice I can give you with the information available
is this: Install as much memory as you can. It is extremely unlikely
that you would ever have too much memory when you're dealing with
terabyte-scale indexes.

Heavy indexing (which you have mentioned as a requirement in another
thread) will tend to require a larger heap.

Thanks,
Shawn


Re: Indexing throughput

2018-05-02 Thread Greenhorn Techie
Thanks Walter and Erick for the valuable suggestions. We shall try out
various values for shards and as well other tuning metrics I discussed in
various threads earlier.

Kind Regards


On 2 May 2018 at 18:24:31, Erick Erickson (erickerick...@gmail.com) wrote:

I've seen 1.5 M docs/second. Basically the indexing throughput is gated
by two things:
1> the number of shards. Indexing throughput essentially scales up
reasonably linearly with the number of shards.
2> the indexing program that pushes data to Solr. Before thinking Solr
is the bottleneck, check how fast your ETL process is pushing docs.

This pre-supposes using SolrJ and CloudSolrClient for the final push
to Solr. This pre-buckets the updates and sends the updates for each
shard to the shard leader, thus reducing the amount of work Solr has
to do. If you use SolrJ, you can easily do <2> above by just
commenting out the single call that pushes the docs to Solr in your
program.

Speaking of which, it's definitely best to batch the updates, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, May 2, 2018 at 10:07 AM, Walter Underwood <wun...@wunderwood.org>
wrote:
> We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb
RAM each
> (EC2 C4.8xlarge). The collection is 24 million documents with four
shards. The cluster
> is Solr 6.6.2. All storage is SSD EBS.
>
> We built a simple batch loader in Java. We get about one million
documents per minute
> with 64 threads. We do not use the cloud-smart SolrJ client. We just send
all the
> batches to the load balancer and let Solr sort it out.
>
> You are looking for 3 million documents per minute. You will just have to
test that.
>
> I haven’t tested it, but indexing should speed up linearly with the
number of shards,
> because those are indexing in parallel.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>> On May 2, 2018, at 9:58 AM, Greenhorn Techie <greenhorntec...@gmail.com>
wrote:
>>
>> Hi,
>>
>> The current hardware profile for our production cluster is 20 nodes,
each
>> with 24cores and 256GB memory. Data being indexed is very structured in
>> nature and is about 30 columns or so, out of which half of them are
>> categorical with a defined list of values. The expected peak indexing
>> throughput is to be about *5* documents per second (expected to be
done
>> at off-peak hours so that search requests will be minimal during this
time)
>> and the average throughput around *1* documents (normal business
>> hours).
>>
>> Given the hardware profile, is it realistic and practical to achieve the
>> desired throughput? What factors affect the performance of indexing
apart
>> from the above hardware characteristics? I understand that its very
>> difficult to provide any guidance unless a prototype is done. But
wondering
>> what are the considerations and dependencies we need to be aware of and
>> whether our throughput expectations are realistic or not.
>>
>> Thanks
>


Indexing throughput

2018-05-02 Thread Greenhorn Techie
Hi,

The current hardware profile for our production cluster is 20 nodes, each
with 24cores and 256GB memory. Data being indexed is very structured in
nature and is about 30 columns or so, out of which half of them are
categorical with a defined list of values. The expected peak indexing
throughput is to be about *5* documents per second (expected to be done
at off-peak hours so that search requests will be minimal during this time)
and the average throughput around *1* documents (normal business
hours).

Given the hardware profile, is it realistic and practical to achieve the
desired throughput? What factors affect the performance of indexing apart
from the above hardware characteristics? I understand that its very
difficult to provide any guidance unless a prototype is done. But wondering
what are the considerations and dependencies we need to be aware of and
whether our throughput expectations are realistic or not.

Thanks


SorCloud Sharding

2018-05-02 Thread Greenhorn Techie
Hi,

I have few questions on sharding in a SolrCloud setup:

1. How to know the optimal number of shards required for a SolrCloud setup?
What are the factors to consider to decide on the value for *numShards*
parameter?
2. In case if over sharding has been done i.e. if numShards has been set to
a very high value, is there a mechanism to merge multiple shards in a
SolrCloud setup?
3. In case if no such merge mechanism is available, is reindexing the only
option to set numShards to a new lower value?

Thnx.


SolrCloud replicaition

2018-05-02 Thread Greenhorn Techie
Hi,

Good Morning!!

In the case of a SolrCloud setup with sharing and replication in place,
when a document is sent for indexing, what happens when only the shard
leader has indexed the document, but the replicas failed, for whatever
reason. Will the document be resent by the leader to the replica shards to
index the document after sometime or how is scenario addressed?

Also, given the above context, when I set the value of min_rf parameter to
say 2, does that mean the calling application will be informed that the
indexing failed?


Solr Heap usage

2018-05-01 Thread Greenhorn Techie
Hi,

Wondering what are the considerations to be aware to arrive at an optimal
heap size for Solr JVM? Though I did discuss this on the IRC, I am still
unclear on how Solr uses the JVM heap space. Are there any pointers to
understand this aspect better?

Given that Solr requires an optimally configured heap, so that the
remaining unused memory can be used for OS disk cache, I wonder how to best
configure Solr heap. Also, on the IRC it was discussed that having 31GB of
heap is better than having 32GB due to Java’s internal usage of heap. Can
anyone guide further on heap configuration please?

Thanks


Query Regarding Solr Garbage Collection

2018-05-01 Thread Greenhorn Techie
Hi,

Following the https://wiki.apache.org/solr/SolrPerformanceFactors article,
I understand that Garbage Collection might be triggered due to significant
increase in JVM heap usage unless a commit is performed. Given this
background, I am curious to understand the reasons / factors that
contribute to increased heap usage of Solr JVM, which would thus force a
Garbage Collection cycle.

Especially, what are the factors that contribute to heap usage increase
during indexing time and what factors contribute during search/query time?

Thanks


Re: SolrCloud Heterogenous Hardware setup

2018-05-01 Thread Greenhorn Techie
Thanks Erick. This information is very helpful. Will explore further on the
node placement rules within Collections API.

Many Thanks


On 1 May 2018 at 16:26:34, Erick Erickson (erickerick...@gmail.com) wrote:

"Is it possible to configure a collection such that the collection
data is only stored on few nodes in the SolrCloud setup?"

Yes. There are "node placement rules", but also you can create a
collection with a createNodeSet that specifies the nodes that the
replicas are placed on.

" If this is possible, at the end of each month, what is the approach
to be taken to “move” the latest collection from higher-spec hardware
machines to the lower-spec ones?"

There are a bunch of ways, in order of how long they've been around
(check your version). All of these are COLLECTIONS API calls.
- ADDREPLICA/DELETEREPLCIA
- MOVEREPLICA
- REPLACENODE

The other thing you may wan to look at is that David Smiley has been
working on timeseries support in Solr, but that's quite recent so may
not be available in whatever version you're using. Nor do I know
enough details a about it to know how (or if) it it supported the
heterogeneous setup you're talking about. Check CHANGES.txt.

Best,
Erick

On Tue, May 1, 2018 at 7:59 AM, Greenhorn Techie
<greenhorntec...@gmail.com> wrote:
> Hi,
>
> We are building a SolrCloud setup, which will index time-series data.
Being
> time-series data with write-once semantics, we are planning to have
> multiple collections i.e. one collection per month. As per our use case,
> end users should be able to query across last 12 months worth of data,
> which means 12 collections (with one collection per month). To achieve
> this, we are planning to leverage Solr collection aliasing such that the
> search_alias collection will point to the 12 collections and indexing
will
> always happen to the latest collection.
>
> As its write-once kind of data, the question I have is whether it is
> possible to have two different hardware profiles within the SolrCloud
> cluster such that all the older collections (being read-only) will be
> stored on the lower hardware spec, while the latest collection (being
write
> heavy) will be stored only on the higher hardware profile machines.
>
> - Is it possible to configure a collection such that the collection data
> is only stored on few nodes in the SolrCloud setup?
> - If this is possible, at the end of each month, what is the approach to
> be taken to “move” the latest collection from higher-spec hardware
machines
> to the lower-spec ones?
>
> TIA.


SolrCloud Heterogenous Hardware setup

2018-05-01 Thread Greenhorn Techie
Hi,

We are building a SolrCloud setup, which will index time-series data. Being
time-series data with write-once semantics, we are planning to have
multiple collections i.e. one collection per month. As per our use case,
end users should be able to query across last 12 months worth of data,
which means 12 collections (with one collection per month). To achieve
this, we are planning to leverage Solr collection aliasing such that the
search_alias collection will point to the 12 collections and indexing will
always happen to the latest collection.

As its write-once kind of data, the question I have is whether it is
possible to have two different hardware profiles within the SolrCloud
cluster such that all the older collections (being read-only) will be
stored on the lower hardware spec, while the latest collection (being write
heavy) will be stored only on the higher hardware profile machines.

   - Is it possible to configure a collection such that the collection data
   is only stored on few nodes in the SolrCloud setup?
   - If this is possible, at the end of each month, what is the approach to
   be taken to “move” the latest collection from higher-spec hardware machines
   to the lower-spec ones?

TIA.


Re: Solr DR Replication

2017-12-07 Thread Greenhorn Techie
Any thoughts / help on this please.

Thanks in advance.

On Wed, 6 Dec 2017 at 16:21 Greenhorn Techie <greenhorntec...@gmail.com>
wrote:

> Hi,
>
> We are on Solr 5.5.2 and wondering what is the best mechanism for
> replicating Solr indexes from a Disaster Recovery perspective. As I
> understand only from Solr6 onwards, we have CDCR. However, I couldn't find
> much content around index replication management for older versions.
> Wondering if there is any such documented solution is available.
>
> From replication perspective, apart from SolrColud collection data, what
> other information need to be copied over from source cluster to the target
> cluster? Should we copy the ZK data as well for the collection?
>
> TIA
>


Time-Series data indexing into Solr

2017-12-07 Thread Greenhorn Techie
Hi,

Is there any recommended approach to index and search time-series data in
Solr?

Thanks in Advance.


Solr DR Replication

2017-12-06 Thread Greenhorn Techie
Hi,

We are on Solr 5.5.2 and wondering what is the best mechanism for
replicating Solr indexes from a Disaster Recovery perspective. As I
understand only from Solr6 onwards, we have CDCR. However, I couldn't find
much content around index replication management for older versions.
Wondering if there is any such documented solution is available.

>From replication perspective, apart from SolrColud collection data, what
other information need to be copied over from source cluster to the target
cluster? Should we copy the ZK data as well for the collection?

TIA


Re: Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie
Hendrik,

Thanks for your response.

Regarding "But this seems to greatly depend on how your setup looks like
and what actions you perform." May I know what are the factors influence
and what considerations are to be taken in relation to this?

Thanks

On Wed, 22 Nov 2017 at 14:16 Hendrik Haddorp <hendrik.hadd...@gmx.net>
wrote:

> We did some testing and the performance was strangely even better with
> HDFS then the with the local file system. But this seems to greatly
> depend on how your setup looks like and what actions you perform. We now
> had a patter with lots of small updates and commits and that seems to be
> quite a bit slower. We are about to do performance testing on that now.
>
> The reason we switched to HDFS was largely connected to us using Docker
> and Marathon/Mesos. With HDFS the data is in a shared file system and
> thus it is possible to move the replica to a different instance on a a
> different host.
>
> regards,
> Hendrik
>
> On 22.11.2017 14:59, Greenhorn Techie wrote:
> > Hi,
> >
> > Good Afternoon!!
> >
> > While the discussion around issues related to "Solr on HDFS" is live, I
> > would like to understand if anyone has done any performance benchmarking
> > for both Solr indexing and search between HDFS vs local file system.
> >
> > Also, from experience, what would the community folks suggest? Solr on
> > local file system or Solr on HDFS? Has anyone done a comparative study of
> > these choices?
> >
> > Thanks
> >
>
>


Solr on HDFS vs local storage - Benchmarking

2017-11-22 Thread Greenhorn Techie
Hi,

Good Afternoon!!

While the discussion around issues related to "Solr on HDFS" is live, I
would like to understand if anyone has done any performance benchmarking
for both Solr indexing and search between HDFS vs local file system.

Also, from experience, what would the community folks suggest? Solr on
local file system or Solr on HDFS? Has anyone done a comparative study of
these choices?

Thanks


Solr / HDPSearch related

2017-11-10 Thread Greenhorn Techie
Hi,

We have a HDP product cluster and are now planning to build a search
solution for some of our business requirements. In this regard, I have the
following questions. Can you please answer the below questions with respect
to Solr?

   - As I understand, it is more performant to have SolrCloud set-up to use
   local storage instead of HDFS for storing the indexes. If so, what are the
   use-cases where SolrCloud would store index in HDFS?
   - Also, if the indexes are stored in HDFS, will it be possible to update
   the documents stored in Solr in that case?
   - Will HDP Search be supported as part of HDP support license itself or
   does it need additional license?
   - If SolrCloud is configured to use local storage, can it still be
   managed through Ambari? What aspects of SolrCloud might not be available
   through Ambari? Monitoring?

Just to provide more context, our data to be indexed is not in HDP at the
moment and would come from external sources.

Thanks


Solr Capacity Planning

2017-06-17 Thread Greenhorn Techie
Hi,

We are planning to setup a Solr cloud for building a search application on
huge volumes of data points (~hundreds of billions of solr documents) I
would like to understand if there is any recommendation on how to size the
infrastructure and hardware requirements for Solr clusters. Also, what are
the best practices to consider during this setup.

Thanks