Hendrik Haddorp commented on SOLR-6305:

[~elyograg] When you store a file in HDFS it ends up being stored in blocks and 
these blocks get replicated to multiple nodes for increased safety. You can 
configure a default block replication factor but you can also create files with 
a specific replication factor. The problem in Solr is that some parts take the 
default replication factor as it is defined on the HDFS name node and some take 
the default. The default is 3 unless you tell Solr that you have a local HDFS 
configuration (using solr.hdfs.confdir). So when you set the default HDFS 
replication factor (done on the name node) to 1 Solr still ends up creating 
files that want to have a replication factor of 3. In case you are using a 
small HDFS test setup that only has one node (data node to be exact) then your 
blocks are still being created but they are under recplicated.

When you tell SolrCloud to use a replicationFactor of 3 then Solr creates 3 
copies of the collection files in HDFS, just like it does in the local case. So 
yes, in your case one could saw that the data exists 9 times. One could also 
see Solr on HDFS like Solr on a shared RAID filesystem. In RAID the files are 
however all replicated in the same way while in HDFS the replication factor of 
the files can be different and changed dynamically.

In my opinion the only problem is that Solr does not create the files in HDFS 
in the same way. Some pick the replication factor as defined on the HDFS name 
node while others don't.

> Ability to set the replication factor for index files created by 
> HDFSDirectoryFactory
> -------------------------------------------------------------------------------------
>                 Key: SOLR-6305
>                 URL: https://issues.apache.org/jira/browse/SOLR-6305
>             Project: Solr
>          Issue Type: Improvement
>          Components: hdfs
>         Environment: hadoop-2.2.0
>            Reporter: Timothy Potter
>            Priority: Major
>         Attachments: 
> 0001-OIQ-23224-SOLR-6305-Fixed-SOLR-6305-by-reading-the-r.patch
> HdfsFileWriter doesn't allow us to create files in HDFS with a different 
> replication factor than the configured DFS default because it uses:     
> {{FsServerDefaults fsDefaults = fileSystem.getServerDefaults(path);}}
> Since we have two forms of replication going on when using 
> HDFSDirectoryFactory, it would be nice to be able to set the HDFS replication 
> factor for the Solr directories to a lower value than the default. I realize 
> this might reduce the chance of data locality but since Solr cores each have 
> their own path in HDFS, we should give operators the option to reduce it.
> My original thinking was to just use Hadoop setrep to customize the 
> replication factor, but that's a one-time shot and doesn't affect new files 
> created. For instance, I did:
> {{hadoop fs -setrep -R 1 solr49/coll1}}
> My default dfs replication is set to 3 ^^ I'm setting it to 1 just as an 
> example
> Then added some more docs to the coll1 and did:
> {{hadoop fs -stat %r solr49/hdfs1/core_node1/data/index/segments_3}}
> 3 <-- should be 1
> So it looks like new files don't inherit the repfact from their parent 
> directory.
> Not sure if we need to go as far as allowing different replication factor per 
> collection but that should be considered if possible.
> I looked at the Hadoop 2.2.0 code to see if there was a way to work through 
> this using the Configuration object but nothing jumped out at me ... and the 
> implementation for getServerDefaults(path) is just:
>   public FsServerDefaults getServerDefaults(Path p) throws IOException {
>     return getServerDefaults();
>   }
> Path is ignored ;-)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to