[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-11-15 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-6966:
--
Attachment: Encryption Codec Documentation.pdf

An initial technical documentation.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, 
> LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% les

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-11-15 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667213#comment-15667213
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Is there still interest from the community in considering this patch as a 
contribution ? Even if there are limitations and therefore this will not cover 
all possible scenarios, we think this provides an initial set of core features 
and a good starting point for future work. We received multiples personal 
request for this patch which shows there is a certain interest for such a 
feature. I am attaching also an initial technical documentation that explains 
how to use the codec and clarifies its current known limitations.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, 
> LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-05-16 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284788#comment-15284788
 ] 

Renaud Delbru commented on LUCENE-6966:
---

I think the latest patch is ready for commit, any objections ?

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch, LUCENE-6966-2-docvalues.patch, 
> LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with 
> th

[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-05-06 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-6966:
--
Attachment: LUCENE-6966-2-docvalues.patch

Here is a separate patch (to apply on top of LUCENE-6966-2) for the doc values 
format. It is a prototype based on an encrypted index input/output. The 
encrypted index output writes encrypted data blocks of fixed size. Each data 
block has its own initialization vector.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch, LUCENE-6966-2-docvalues.patch, 
> LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were en

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-04-28 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262520#comment-15262520
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Hi [~joel.bernstein],

{quote}
1) With the latest patch do you feel the major concerns have been addressed.
{quote}

Yes, the latest patch does not reuse IVs anymore but instead use a different IV 
for each data block. It also introduces an API so that one can have control on 
how IVs are generated and how the cipher is instantiated.

{quote}
2) From my initial reading of the patch it seemed like everything in the patch 
was pluggable. Does this need to be committed to be usable? Or can it be hosted 
on another project?

3) Because it's such a large patch and codecs change over time, does it present 
a burden to maintain with the core Lucene project? Along these lines is it more 
appropriate from a maintenance standpoint to be maintained by people who are 
really motivated to have this feature. Alfresco engineers would likely 
participate in an outside project if one existed.
{quote}

The patch follows the standard rules of Lucene codecs, so yes, it is fully 
pluggable. Similar to other codecs, however, the burden to maintain it will be 
low. It is a set of Lucene's *Format classes that are loosely coupled with 
other part of the Lucene code. It will likely require maintenance only when the 
high-level Lucene's Codec and Format API changes.
 
The patch is large because we had to make a copy of some of the original lucene 
*Format classes, as those classes were final and not extensible. If one wants 
to update them with the latest improvements made in the original classes, this 
might require a bit more effort, but from my personal experience it was so far 
straightforward.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more document

[jira] [Commented] (SOLR-6465) CDCR: fall back to whole-index replication when tlogs are insufficient

2016-04-20 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249622#comment-15249622
 ] 

Renaud Delbru commented on SOLR-6465:
-

It would be great indeed to be able to simplify the code as you proposed if we 
can rely on a bootstrap method. Below are some observations that might be 
useful.

One of the concern I have is related to the default size limit of the update 
logs. By default, it keeps 10 tlog files or 100 records. This will likely be 
too small for providing enough buffer for cdcr, and there might be a risk of a 
continuous cycle of bootstrapping replication. One could increase the values of 
"numRecordsToKeep" and "maxNumLogsToKeep" in solrconfig to accommodate the cdcr 
requirements. But this is an additional parameter that the user needs to take 
into consideration, and make configuration more complex. I am wondering if we 
could find a more appropriate default value for cdcr ?

The issue with increasing limits in the original update log compared to the 
cdcr update log is that the original update log will not clean old tlogs files 
(it will keep all tlogs up to that limit) that are not necessary anymore for 
the replication. For example, if one increase the maxNumLogsToKeep to 100 and 
numRecordsToKeep 1000, then the node will always have 100 tlogs files or 1000 
records in the update logs, even if all of them has been replicated to the 
target clusters. This might cause unexpected issues related to disk space or 
performance.

The CdcrUpdateLog was managing this by allowing a variable size update log that 
removes a tlog when it has been fully replicated. But then this means we go 
back to where we were with all the added management around the cdcr update log, 
i.e., buffer, lastprocessedversion, CdcrLogSynchronizer, ...

h4. Cdcr Buffer

If we get rid of the cdcr update log logic, then we can also get rid of the 
Cdcr Buffer (buffer state, buffer commands, etc.)

h4. CdcrUpdateLog

I am not sure if we can get entirely rid of the CdcrUpdateLog. It includes 
logic such as sub-reader and forward seek that are necessary for sending batch 
updates. Maybe this logic can be moved in the UpdateLog ?

h4. CdcrLogSynchronizer

I think it is safe to get rid of this. In the case where a leader goes down 
while a cdcr reader is forwarding updates, the new leader will likely miss the 
tlogs necessary to resume where the cdcr reader stopped. But in this case, it 
can fall back to bootstrapping.

h4. Tlog Replication

If the tlogs are not replicated during a bootstrap, then tlogs on target will 
not be in synch. Could this cause any issues on the target cluster, e.g., in 
case of a recovery ? 
If the target is itself configured as a source (i.e. daisy chain), this will 
probably cause issues. The update logs will likely contain gaps, and it will be 
very difficult for the source to know that there is a gap. Therefore, it might 
forward incomplete updates. But this might be a feature we could drop, as 
suggested in one of your comment on the cwiki.

> CDCR: fall back to whole-index replication when tlogs are insufficient
> --
>
> Key: SOLR-6465
> URL: https://issues.apache.org/jira/browse/SOLR-6465
> Project: Solr
>  Issue Type: Sub-task
>Reporter: Yonik Seeley
> Attachments: SOLR-6465.patch, SOLR-6465.patch
>
>
> When the peer-shard doesn't have transaction logs to forward all the needed 
> updates to bring a peer up to date, we need to fall back to normal 
> replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6465) CDCR: fall back to whole-index replication when tlogs are insufficient

2016-04-19 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247855#comment-15247855
 ] 

Renaud Delbru commented on SOLR-6465:
-

[~shalinmangar], would the goal be to rely solely on the bootstrapping method 
to replicate indexes, instead of using the updates forwarding method (i.e., 
cdcr update logs) ? Or would it be a combination of bootstrapping and updates 
forwarding (based on the original update log, not the cdcr one) ?

> CDCR: fall back to whole-index replication when tlogs are insufficient
> --
>
> Key: SOLR-6465
> URL: https://issues.apache.org/jira/browse/SOLR-6465
> Project: Solr
>  Issue Type: Sub-task
>Reporter: Yonik Seeley
> Attachments: SOLR-6465.patch, SOLR-6465.patch
>
>
> When the peer-shard doesn't have transaction logs to forward all the needed 
> updates to bring a peer up to date, we need to fall back to normal 
> replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-04-05 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-6966:
--
Attachment: LUCENE-6966-2.patch

This patch includes changes so that every encrypted data block uses a new iv. 
The iv is encoded in the header of the data block. The CipherFactory has been 
extended so that people can decide on how to instantiate a cipher and how to 
generate new ivs.

The performance impact of storing and using a unique iv per block is minimal. 
The results of the benchmark below (performed on the full wikipedia dataset) 
show that there is no significant difference in qps:

{noformat}
TaskQPS 6966-before StdDevQPS 6966-after  StdDev
Pct diff
 Respell   20.56 (11.2%)   19.18  (7.9%)   
-6.7% ( -23% -   13%)
  Fuzzy2   33.98 (11.7%)   32.76 (11.0%)   
-3.6% ( -23% -   21%)
  Fuzzy1   31.13 (11.2%)   30.05  (8.2%)   
-3.5% ( -20% -   17%)
PKLookup  125.62 (13.0%)  121.38  (8.8%)   
-3.4% ( -22% -   21%)
Wildcard   35.10 (11.7%)   34.36  (8.2%)   
-2.1% ( -19% -   20%)
OrNotHighMed   25.90 (11.4%)   25.86 (10.5%)   
-0.2% ( -19% -   24%)
   OrNotHighHigh   15.26 (12.1%)   15.28 (10.8%)
0.2% ( -20% -   26%)
   OrHighNotHigh9.80 (12.4%)9.82 (12.0%)
0.2% ( -21% -   28%)
OrHighNotMed   13.01 (13.4%)   13.06 (13.0%)
0.4% ( -22% -   30%)
 LowTerm  252.64 (12.5%)  253.90  (8.7%)
0.5% ( -18% -   24%)
OrHighNotLow   35.63 (13.5%)   35.83 (13.4%)
0.6% ( -23% -   31%)
 Prefix3   21.70 (13.3%)   21.86  (9.7%)
0.7% ( -19% -   27%)
 MedTerm   83.04 (11.7%)   83.73  (8.0%)
0.8% ( -16% -   23%)
 AndHighHigh   15.41 (10.6%)   15.61  (7.9%)
1.3% ( -15% -   22%)
 LowSloppyPhrase   68.89 (12.5%)   69.90  (9.0%)
1.5% ( -17% -   26%)
  AndHighLow  294.02 (11.6%)  299.04  (8.3%)
1.7% ( -16% -   24%)
   OrHighMed   10.92 (14.4%)   11.13 (10.8%)
1.9% ( -20% -   31%)
  OrHighHigh9.45 (14.6%)9.63 (10.9%)
1.9% ( -20% -   32%)
 MedSpanNear   69.01 (11.9%)   70.39  (8.4%)
2.0% ( -16% -   25%)
  AndHighMed   45.16 (12.4%)   46.17  (9.1%)
2.2% ( -17% -   27%)
HighTerm   16.61 (13.3%)   16.99  (9.5%)
2.3% ( -18% -   28%)
   LowPhrase3.03 (11.1%)3.10  (9.2%)
2.3% ( -16% -   25%)
  HighPhrase   11.82 (13.0%)   12.10  (9.6%)
2.4% ( -17% -   28%)
   MedPhrase7.49 (12.1%)7.67  (9.1%)
2.4% ( -16% -   26%)
OrNotHighLow  424.80 (11.1%)  434.97  (8.2%)
2.4% ( -15% -   24%)
   OrHighLow   25.08 (12.0%)   25.70 (11.7%)
2.5% ( -18% -   29%)
HighSloppyPhrase4.01 (13.7%)4.11  (9.7%)
2.5% ( -18% -   30%)
 MedSloppyPhrase6.61 (12.9%)6.78  (9.2%)
2.5% ( -17% -   28%)
 LowSpanNear   15.52 (11.8%)   15.91  (8.6%)
2.5% ( -16% -   26%)
  IntNRQ3.76 (16.4%)3.86 (13.1%)
2.7% ( -23% -   38%)
HighSpanNear4.40 (12.8%)4.52  (9.1%)
2.8% ( -16% -   28%)
{noformat}

I have took the occasion to run another benchmark to compare this patch against 
lucene's master. We can see that queries on low frequency terms (probably 
because the dictionary lookup becomes more costly than reading of the posting 
list) and queries that needs to scan a large portion of the dictionary are the 
most impacted.

{noformat}
Task  QPS master  StdDevQPS 6966 StdDev 
   Pct diff
  Fuzzy1   55.08 (15.5%)   35.89  (8.2%)  
-34.8% ( -50% -  -13%)
 Respell   39.31 (16.9%)   28.47  (8.2%)  
-27.6% ( -45% -   -3%)
  Fuzzy2   35.33 (16.8%)   28.21  (8.8%)  
-20.1% ( -39% -6%)
Wildcard   11.13 (18.9%)9.95  (7.9%)  
-10.6% ( -31% -   19%)
  AndHighLow  304.79 (17.7%)  277.30 (10.4%)   
-9.0% ( -31% -   23%)
OrNotHighLow  240.56 (16.8%)  226.64 (10.2%)   
-5.8% ( -28% -   25%)
PKLookup  129.54 (20.1%)  122.47  (8.3%)   
-5.5% ( -28% -   28

[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-24 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-6966:
--
Attachment: LUCENE-6966-1.patch

This patch contains the current state of the codec for index-level encryption. 
It is up to date with the latest version of the lucene-solr master branch. This 
patch does not include yet the ability for the users to choose which cipher to 
use. I'll submit a new patch that will tackle this issue in the next coming 
week.
The full lucene test suite has been executed against this codec using the 
command:
{code}
ant -Dtests.codec=EncryptedLucene60 test
{code}
Only one test fails, TestSizeBoundedForceMerge#testByteSizeLimit, which is 
expected. This test is incompatible with the codec.

The doc values format (prototype based on an encrypted index output) is not 
included in this patch, and will be submitted as a separate patch in the next 
coming days.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
>

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-21 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15204549#comment-15204549
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Karl, the patch will not include a ready to use FSDirectory implementation, but 
the doc value format is based on an encrypted index input and output 
implementation which can easily be reused in an implementation of FSDirectory.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-19 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197431#comment-15197431
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Thanks for all of the feedback. Based on everyone's comments, it seems like 
different encryption algorithms might be better depending on the situation. 
Rather than implement a one-size-fits-all solution then, perhaps it would be 
better not to enforce any one cipher and instead leave some flexibility for 
users to choose the cipher they find more appropriate. 

If everyone is okay with this approach, I will update the code appropriately.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset w

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-01-08 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089007#comment-15089007
 ] 

Renaud Delbru commented on LUCENE-6966:
---

I agree with you that if we add encryption to Lucene, it should always be 
secure. That's why I opened up the discussion with the commnunity in order to 
review and agree on which approach to adopt. 
With respect to IV reuse with CBC mode, a potential leak of information occurs 
when two messages share a common prefix, as it will reveal the presence and 
length of that prefix.
Now if we look at each format separately and at what type of messages is 
encrypted in each one, we can assess the risk:
- Term Dictionary Index: the entire term dictionary index in a segment will be 
encrypted as one single message - risk is null
- Term Dictionary Data: each suffixes bytes blob is encrypted as one message - 
I would assume that the probability of having two suffixes bytes blobs sharing 
the same prefix or being identical is pretty low. But I might be wrong.
- Stored Fields Format: each compressed doc chunk is encrypted as one message - 
a doc chunk can contain the exact same data (e.g., if multiple documents 
contain the same exact fields and values). This is more likely to happen but it 
sounds like more an edge case.
- Terms Vector: each compress terms and payloads bytes blob of doc chunk is 
encrypted as one message - same issue than with Stored Fields Format

The risk of reusing IV seems to reside in Stored Fields / Terms Vector is not 
acceptable, one solution is to add a random generated header to each compressed 
doc chunk that will serve as a unique IV. What do you think ?

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>    Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms a

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-01-07 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087144#comment-15087144
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Discussion copied from the following [dev 
thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/201601.mbox/%3C568D2289.5080408@siren.solutions%3E]
 

{quote}
I would strongly recommend against "invent your own mode", and instead
using standardized schemes/modes (e.g. XTS).

Separate from that, I don't understand the reasoning to do it at the
codec level. seems quite a bit more messy and complicated than the
alternatives, such as block device level (e.g. dm-crypt), or
filesystem level (e.g. ext4 filesystem encryption), which have the
advantage of the filesystem cache actually working.
{quote}

[~rcmuir], 

Yes, you are right. This approach is more complex than plain fs level 
encryption, but this enables more fine-grained control on what is encrypted. 
For example, it would not be possible to choose which field to encrypt or not. 
Also, with fs level encryption, all the data is encrypted regardless if it is 
sensitive or not. For example, in such a scenario, the full posting lists will 
be encrypted which is unnecessary, and you'll pay the cost of encrypting the 
posting lists.
It is true that if the filesystem caches unencrypted pages, then with a warm 
cache you will likely get better performance. However, this also means that 
most of the index data will reside in memory in an unencrypted form. If the 
server is compromised, then this will make life easier for the attacker. You 
have also the (small) issue with the swap which can end up with a large portion 
of the index unencrypted. This can be solved by using an encrypted swap, but 
this means that the data is now encrypted using a unique key and not a per-user 
key. Also, this adds complexity in the management of the system.
Highly sensitive installations can make the trade-off between performance and 
security. There are some applications for Solr that are not served by the other 
approaches.

This codec was developed in the context of a large multi-tenant architecture, 
where each user has its own index / collection. Each user has its own key, and 
can update his key at any time.
While it seems it would be possible with ext4 to handle a per-user key (e.g., 
one key per directory), it makes the key and index management more complex 
(especially in SolrCloud). This is not adequate for some environments.
Also, it does not allow the management of multiple key versions in one index. 
If the user changes his key, we have to re-encrypt the full directory which is 
not acceptable wrt performance for some environments.

The codec level encryption approach is more adequate for some environments than 
the fs level encryption approach. Also, it is to be noted that this codec does 
not affect the rest of Lucene/Solr. Users will be able to choose which approach 
is more adequate for their environment. This gives more options to Lucene/Solr 
users.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implem

Re: Contribution: Codec for index-level encryption

2016-01-07 Thread Renaud Delbru

Hi Robert,

Yes, you are right. This approach is more complex than plain fs level 
encryption, but this enables more fine-grained control on what is 
encrypted. For example, it would not be possible to choose which field 
to encrypt or not. Also, with fs level encryption, all the data is 
encrypted regardless if it is sensitive or not. For example, in such a 
scenario, the full posting lists will be encrypted which is unnecessary, 
and you'll pay the cost of encrypting the posting lists.
It is true that if the filesystem caches unencrypted pages, then with a 
warm cache you will likely get better performance. However, this also 
means that most of the index data will reside in memory in an 
unencrypted form. If the server is compromised, then this will make life 
easier for the attacker. You have also the (small) issue with the swap 
which can end up with a large portion of the index unencrypted. This can 
be solved by using an encrypted swap, but this means that the data is 
now encrypted using a unique key and not a per-user key. Also, this adds 
complexity in the management of the system.
Highly sensitive installations can make the trade-off between 
performance and security. There are some applications for Solr that are 
not served by the other approaches.


This codec was developed in the context of a large multi-tenant 
architecture, where each user has its own index / collection. Each user 
has its own key, and can update his key at any time.
While it seems it would be possible with ext4 to handle a per-user key 
(e.g., one key per directory), it makes the key and index management 
more complex (especially in SolrCloud). This is not adequate for some 
environments.
Also, it does not allow the management of multiple key versions in one 
index. If the user changes his key, we have to re-encrypt the full 
directory which is not acceptable wrt performance for some environments.


The codec level encryption approach is more adequate for some 
environments than the fs level encryption approach. Also, it is to be 
noted that this codec does not affect the rest of Lucene/Solr. Users 
will be able to choose which approach is more adequate for their 
environment. This gives more options to Lucene/Solr users.


P.S.: I have created the issue LUCENE-6966 and move the discussion 
there, as it is more simple for external people to participate to the 
discussions.


Regards
--
Renaud Delbru

On 06/01/16 15:32, Robert Muir wrote:

I would strongly recommend against "invent your own mode", and instead
using standardized schemes/modes (e.g. XTS).

Separate from that, I don't understand the reasoning to do it at the
codec level. seems quite a bit more messy and complicated than the
alternatives, such as block device level (e.g. dm-crypt), or
filesystem level (e.g. ext4 filesystem encryption), which have the
advantage of the filesystem cache actually working.


On Wed, Jan 6, 2016 at 4:19 AM, Renaud Delbru <renaud@siren.solutions> wrote:

Dear all,

We would like to contribute a codec that enables the encryption of sensitive
data in the index that has been developed as part of an engagement with a
customer. We think that this could be of interest for the community. If that
is the case, I’ll open a JIRA ticket and upload a first version of the
patch. We are also looking for feedbacks on the approach.

Below is a description of the project.

= Introduction

In comparison with approaches where all data is encrypted (e.g., file system
encryption, index output / directory encryption), encryption at a codec
level enables more fine-grained control on which block of data is encrypted.
This is more efficient since less data has to be encrypted. This also gives
more flexibility such as the ability to select which field to encrypt.

Some of the requirements for this project were:

- The performance impact of the encryption should be reasonable.
- The user can choose which field to encrypt.
- Key management: During the life cycle of the index, the user can provide a
new version of his encryption key. Multiple key versions should co-exist in
one index.

= What is supported ?

- Block tree terms index and dictionary
- Compressed stored fields format
- Compressed term vectors format
- Doc values format (prototype based on an encrypted index output) - this
will be submitted as a separated patch
- Index upgrader: command to upgrade all the index segments with the latest
key version available.

= How it is implemented ?

== Key Management

One index segment is encrypted with a single key version. An index can have
multiple segments, each one encrypted using a different key version. The key
version for a segment is stored in the segment info.

The provided codec is abstract, and a subclass is responsible in providing
an implementation of the cipher factory. The cipher factory is responsible
of the creation of a cipher instance based on a given key version.

== Encryption Model

The encryption model is based on AES/

[jira] [Created] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-01-07 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-6966:
-

 Summary: Contribution: Codec for index-level encryption
 Key: LUCENE-6966
 URL: https://issues.apache.org/jira/browse/LUCENE-6966
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/other
Reporter: Renaud Delbru


We would like to contribute a codec that enables the encryption of sensitive 
data in the index that has been developed as part of an engagement with a 
customer. We think that this could be of interest for the community.

Below is a description of the project.

h1. Introduction

In comparison with approaches where all data is encrypted (e.g., file system 
encryption, index output / directory encryption), encryption at a codec level 
enables more fine-grained control on which block of data is encrypted. This is 
more efficient since less data has to be encrypted. This also gives more 
flexibility such as the ability to select which field to encrypt.

Some of the requirements for this project were:

* The performance impact of the encryption should be reasonable.
* The user can choose which field to encrypt.
* Key management: During the life cycle of the index, the user can provide a 
new version of his encryption key. Multiple key versions should co-exist in one 
index.

h1. What is supported ?

- Block tree terms index and dictionary
- Compressed stored fields format
- Compressed term vectors format
- Doc values format (prototype based on an encrypted index output) - this will 
be submitted as a separated patch
- Index upgrader: command to upgrade all the index segments with the latest key 
version available.

h1. How it is implemented ?

h2. Key Management

One index segment is encrypted with a single key version. An index can have 
multiple segments, each one encrypted using a different key version. The key 
version for a segment is stored in the segment info.

The provided codec is abstract, and a subclass is responsible in providing an 
implementation of the cipher factory. The cipher factory is responsible of the 
creation of a cipher instance based on a given key version.

h2. Encryption Model

The encryption model is based on AES/CBC with padding. Initialisation vector 
(IV) is reused for performance reason, but only on a per format and per segment 
basis.

While IV reuse is usually considered a bad practice, the CBC mode is somehow 
resilient to IV reuse. The only "leak" of information that this could lead to 
is being able to know that two encrypted blocks of data starts with the same 
prefix. However, it is unlikely that two data blocks in an index segment will 
start with the same data:

- Stored Fields Format: Each encrypted data block is a compressed block (~4kb) 
of one or more documents. It is unlikely that two compressed blocks start with 
the same data prefix.

- Term Vectors: Each encrypted data block is a compressed block (~4kb) of terms 
and payloads from one or more documents. It is unlikely that two compressed 
blocks start with the same data prefix.

- Term Dictionary Index: The term dictionary index is encoded and encrypted in 
one single data block.

- Term Dictionary Data: Each data block of the term dictionary encodes a set of 
suffixes. It is unlikely to have two dictionary data blocks sharing the same 
prefix within the same segment.

- DocValues: A DocValues file will be composed of multiple encrypted data 
blocks. It is unlikely to have two data blocks sharing the same prefix within 
the same segment (each one will encodes a list of values associated to a field).

To the best of our knowledge, this model should be safe. However, it would be 
good if someone with security expertise in the community could review and 
validate it. 

h1. Performance

We report here a performance benchmark we did on an early prototype based on 
Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all the 
fields (id, title, body, date) were encrypted. Only the block tree terms and 
compressed stored fields format were tested at that time. 

h2. Indexing

The indexing throughput slightly decreased and is roughly 15% less than with 
the base Lucene. 

The merge time slightly increased by 35%.

There was no significant difference in term of index size.

h2. Query Throughput

With respect to query throughput, we observed no significant impact on the 
following queries: Term query, boolean query, phrase query, numeric range 
query. 

We observed the following performance impact for queries that needs to scan a 
larger portion of the term dictionary:

- prefix query: decrease of ~25%
- wildcard query (e.g., “fu*r”): decrease of ~60%
- fuzzy query (distance 1): decrease of ~40%
- fuzzy query (distance 2): decrease of ~80%

We can see that the decrease of performance is relative to the size of the 
dictionary scan.

h2. Document Retrieval

We observed a decrease of performance that i

Contribution: Codec for index-level encryption

2016-01-06 Thread Renaud Delbru
keep order of fields since 
non-encrypted and encrypted fields are stored in separated blocks.


- the current implementation of the cipher factory does not enforce the 
use of AES/CBC. We are planning to add this to the final version of the 
patch.


- the current implementation does not change the IV per segment. We are 
planning to add this to the final version of the patch.


- the current implementation of compressed stored fields decrypts a full 
compressed block even if a small portion is decompressed (high impact 
when storing very small documents). We are planning to add this 
optimisation to the final version of the patch. The overall document 
retrieval performance might increase with this optimisation.


The codec has been implemented as a contrib. Given that most of the 
classes were final, we had to copy most of the original code from the 
extended formats. At a later stage, we could think of opening some of 
these classes to extend them properly in order to reduce code 
duplication and simplify code maintenance.


--
Renaud Delbru


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-12-07 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044802#comment-15044802
 ] 

Renaud Delbru commented on SOLR-8263:
-

While the patch SOLR-8263-trunk-3 which added the dedup logic for the buffered 
updates seems straightforward, it introduced an issue which could lead to loss 
of documents.
The dedup logic was using the version of the last operation from the tlog files 
transferred from the master as a starting point for the dedup logic. However, 
these tlog files were not in synch with the index commit point, there were 
likely ahead of the index commit point (i.e., there were containing operations 
that occurred after the index commit point). Therefore, the starting point of 
the dedup logic was ahead of the index commit point, and therefore it was 
dropping all operations that occurred between the index commit point and the 
time the tlog files were transferred from the master.
In order to solve this, we had to modify the ReplicationHandler to filter out 
tlog files that were not associated to a given commit point. To find the tlog 
files associated to an index commit point, we fetch the max version of an index 
commit using VersionInfo.getMaxVersionFromIndex and use this version number to 
discard tlog files. Tlog file name encodes the version of their starting 
operation (this was originally used for seeking more efficiently across 
multiple tlog files), and we use this starting version to discard tlog that 
were created after the commit point (i.e., if starting version > max version).
The new patch committed by Erick includes this approach.


> Tlog replication could interfere with the replay of buffered updates
> 
>
> Key: SOLR-8263
> URL: https://issues.apache.org/jira/browse/SOLR-8263
> Project: Solr
>  Issue Type: Sub-task
>    Reporter: Renaud Delbru
>Assignee: Erick Erickson
> Fix For: 5.5, 6.0
>
> Attachments: SOLR-6273-plus-8263-5x.patch, SOLR-8263-trunk-1.patch, 
> SOLR-8263-trunk-2.patch, SOLR-8263-trunk-3.patch, SOLR-8263.patch
>
>
> The current implementation of the tlog replication might interfere with the 
> replay of the buffered updates. The current tlog replication works as follow:
> 1) Fetch the the tlog files from the master
> 2) reset the update log before switching the tlog directory
> 3) switch the tlog directory and re-initialise the update log with the new 
> directory.
> Currently there is no logic to keep "buffered updates" while resetting and 
> reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF

2015-12-03 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037616#comment-15037616
 ] 

Renaud Delbru commented on SOLR-8292:
-

Perhaps related to SOLR-4116 ?

> TransactionLog.next() does not honor contract and return null for EOF
> -
>
> Key: SOLR-8292
> URL: https://issues.apache.org/jira/browse/SOLR-8292
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>
> This came to light in CDCR testing, which stresses this code a lot, there's a 
> stack trace showing this line (641 trunk) throwing an EOF exception:
> o = codec.readVal(fis);
> At first I thought to just wrap reading fis in a try/catch and return null, 
> but looking at the code a bit more I'm not so sure, that seems like it'd mask 
> what looks at first glance like a bug in the logic.
> A few lines earlier (633-4) there's these lines:
> // shouldn't currently happen - header and first record are currently written 
> at the same time
> if (fis.position() >= fos.size()) {
> Why are we comparing the the input file position against the size of the 
> output file? Maybe because the 'i' key is right next to the 'o' key? The 
> comment hints that it's checking for the ability to read the first record in 
> input stream along with the header. And perhaps there's a different issue 
> here because the expectation clearly is that the first record should be there 
> if the header is.
> So what's the right thing to do? Wrap in a try/catch and return null for EOF? 
> Change the test? Do both?
> I can take care of either, but wanted a clue whether the comparison of fis to 
> fos is intended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-11-24 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-8263:

Attachment: SOLR-8263-trunk-3.patch

[~shalinmangar] [~erickerickson] A new patch including the dedup logic for the 
buffered updates. I have launched a few run without any issue. The change is 
minimal, but it might be good to beast it a last time ?

> Tlog replication could interfere with the replay of buffered updates
> 
>
> Key: SOLR-8263
> URL: https://issues.apache.org/jira/browse/SOLR-8263
> Project: Solr
>  Issue Type: Sub-task
>    Reporter: Renaud Delbru
>Assignee: Erick Erickson
> Attachments: SOLR-8263-trunk-1.patch, SOLR-8263-trunk-2.patch, 
> SOLR-8263-trunk-3.patch
>
>
> The current implementation of the tlog replication might interfere with the 
> replay of the buffered updates. The current tlog replication works as follow:
> 1) Fetch the the tlog files from the master
> 2) reset the update log before switching the tlog directory
> 3) switch the tlog directory and re-initialise the update log with the new 
> directory.
> Currently there is no logic to keep "buffered updates" while resetting and 
> reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-11-24 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024736#comment-15024736
 ] 

Renaud Delbru edited comment on SOLR-8263 at 11/24/15 4:07 PM:
---

[~shalinmangar] [~erickerickson] A new patch including the dedup logic for the 
buffered updates. I have launched a few runs without any issue. The changes are 
minimal, but it might be good to beast it a last time ?


was (Author: rendel):
[~shalinmangar] [~erickerickson] A new patch including the dedup logic for the 
buffered updates. I have launched a few run without any issue. The change is 
minimal, but it might be good to beast it a last time ?

> Tlog replication could interfere with the replay of buffered updates
> 
>
> Key: SOLR-8263
> URL: https://issues.apache.org/jira/browse/SOLR-8263
> Project: Solr
>  Issue Type: Sub-task
>    Reporter: Renaud Delbru
>Assignee: Erick Erickson
> Attachments: SOLR-8263-trunk-1.patch, SOLR-8263-trunk-2.patch, 
> SOLR-8263-trunk-3.patch
>
>
> The current implementation of the tlog replication might interfere with the 
> replay of the buffered updates. The current tlog replication works as follow:
> 1) Fetch the the tlog files from the master
> 2) reset the update log before switching the tlog directory
> 3) switch the tlog directory and re-initialise the update log with the new 
> directory.
> Currently there is no logic to keep "buffered updates" while resetting and 
> reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-11-24 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024675#comment-15024675
 ] 

Renaud Delbru commented on SOLR-8263:
-

[~shalinmangar] Yes, you understood the sequence correctly. To be more precise 
here is how it works:
1) the tlog files of the leader are downloaded in a temporary directory
2) After the files have been downloaded properly, a write lock is acquired by 
the IndexFetcher. The original tlog directory is renamed as a backup directory, 
and the temporary directory is renamed as the active tlog directory.
3) The update log is reset with the new active log directory. During this 
reset, the recovery info is used to read the backup buffered tlog file and 
every buffered operation is copied to the new buffered tlog.
4) The write lock is released, and the recovery operation will continue and 
apply the buffered updates.

Indeed, the buffered tlog can contain duplicate operations with the replica 
update log. During the recovery operation, the replica might receive from the 
leader some operations that will be buffered, but they might be also present in 
one of the tlog that is downloaded from the leader. Apart from the disk space 
usage of these duplicate operations and the additional network transfer, there 
is no harm, as these duplicate operations will be ignored by the peer cluster. 
We could improve the tlog recovery operation to de-duplicate the buffered tlog 
while copying the buffered updates. We could check the version of the latest 
operations in the downloaded tlog, and skip operations from the buffered tlog 
if their version is inferior to the latest know. It should be a relatively 
small patch. I can try to work on that in the next days and submit something, 
if that's fine with you and [~erickerickson] ?



> Tlog replication could interfere with the replay of buffered updates
> 
>
> Key: SOLR-8263
> URL: https://issues.apache.org/jira/browse/SOLR-8263
> Project: Solr
>  Issue Type: Sub-task
>    Reporter: Renaud Delbru
>Assignee: Erick Erickson
> Attachments: SOLR-8263-trunk-1.patch, SOLR-8263-trunk-2.patch
>
>
> The current implementation of the tlog replication might interfere with the 
> replay of the buffered updates. The current tlog replication works as follow:
> 1) Fetch the the tlog files from the master
> 2) reset the update log before switching the tlog directory
> 3) switch the tlog directory and re-initialise the update log with the new 
> directory.
> Currently there is no logic to keep "buffered updates" while resetting and 
> reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF

2015-11-18 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15010699#comment-15010699
 ] 

Renaud Delbru commented on SOLR-8292:
-

I have checked on the cdcr code side, and whenever a log reader is used, it is 
by a single thread only. So the problem might be laying somewhere else.

> TransactionLog.next() does not honor contract and return null for EOF
> -
>
> Key: SOLR-8292
> URL: https://issues.apache.org/jira/browse/SOLR-8292
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>
> This came to light in CDCR testing, which stresses this code a lot, there's a 
> stack trace showing this line (641 trunk) throwing an EOF exception:
> o = codec.readVal(fis);
> At first I thought to just wrap reading fis in a try/catch and return null, 
> but looking at the code a bit more I'm not so sure, that seems like it'd mask 
> what looks at first glance like a bug in the logic.
> A few lines earlier (633-4) there's these lines:
> // shouldn't currently happen - header and first record are currently written 
> at the same time
> if (fis.position() >= fos.size()) {
> Why are we comparing the the input file position against the size of the 
> output file? Maybe because the 'i' key is right next to the 'o' key? The 
> comment hints that it's checking for the ability to read the first record in 
> input stream along with the header. And perhaps there's a different issue 
> here because the expectation clearly is that the first record should be there 
> if the header is.
> So what's the right thing to do? Wrap in a try/catch and return null for EOF? 
> Change the test? Do both?
> I can take care of either, but wanted a clue whether the comparison of fis to 
> fos is intended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF

2015-11-16 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006519#comment-15006519
 ] 

Renaud Delbru commented on SOLR-8292:
-

I have reviewed all the methods writing records to the tlog file, and all of 
them are properly synchronised with the flushing of the output stream. 
However, the access to the input stream is not synchronised. Could it be that 
one concurrent thread changed the fis position while another one was trying to 
read a record ? The CdcrLogReader#resetToLastPosition could interfere with the 
TransactionLog.LogReader#next.


> TransactionLog.next() does not honor contract and return null for EOF
> -
>
> Key: SOLR-8292
> URL: https://issues.apache.org/jira/browse/SOLR-8292
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>
> This came to light in CDCR testing, which stresses this code a lot, there's a 
> stack trace showing this line (641 trunk) throwing an EOF exception:
> o = codec.readVal(fis);
> At first I thought to just wrap reading fis in a try/catch and return null, 
> but looking at the code a bit more I'm not so sure, that seems like it'd mask 
> what looks at first glance like a bug in the logic.
> A few lines earlier (633-4) there's these lines:
> // shouldn't currently happen - header and first record are currently written 
> at the same time
> if (fis.position() >= fos.size()) {
> Why are we comparing the the input file position against the size of the 
> output file? Maybe because the 'i' key is right next to the 'o' key? The 
> comment hints that it's checking for the ability to read the first record in 
> input stream along with the header. And perhaps there's a different issue 
> here because the expectation clearly is that the first record should be there 
> if the header is.
> So what's the right thing to do? Wrap in a try/catch and return null for EOF? 
> Change the test? Do both?
> I can take care of either, but wanted a clue whether the comparison of fis to 
> fos is intended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8292) TransactionLog.next() does not honor contract and return null for EOF

2015-11-16 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006537#comment-15006537
 ] 

Renaud Delbru commented on SOLR-8292:
-

The convention over the log reader is that it should not be used by more than 
one thread (see comment on TransactionLog#getReader). I'll double check if the 
cdcr code is respecting that.

> TransactionLog.next() does not honor contract and return null for EOF
> -
>
> Key: SOLR-8292
> URL: https://issues.apache.org/jira/browse/SOLR-8292
> Project: Solr
>  Issue Type: Bug
>Reporter: Erick Erickson
>
> This came to light in CDCR testing, which stresses this code a lot, there's a 
> stack trace showing this line (641 trunk) throwing an EOF exception:
> o = codec.readVal(fis);
> At first I thought to just wrap reading fis in a try/catch and return null, 
> but looking at the code a bit more I'm not so sure, that seems like it'd mask 
> what looks at first glance like a bug in the logic.
> A few lines earlier (633-4) there's these lines:
> // shouldn't currently happen - header and first record are currently written 
> at the same time
> if (fis.position() >= fos.size()) {
> Why are we comparing the the input file position against the size of the 
> output file? Maybe because the 'i' key is right next to the 'o' key? The 
> comment hints that it's checking for the ability to read the first record in 
> input stream along with the header. And perhaps there's a different issue 
> here because the expectation clearly is that the first record should be there 
> if the header is.
> So what's the right thing to do? Wrap in a try/catch and return null for EOF? 
> Change the test? Do both?
> I can take care of either, but wanted a clue whether the comparison of fis to 
> fos is intended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



EOF contract in TransactionLog

2015-11-13 Thread Renaud Delbru

Dear all,

in one of the unit tests of CDCR, we stumble upon the following issue:

 [junit4]   2> java.io.EOFException
 [junit4]   2>at 
org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:208)
 [junit4]   2>at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:198)
 [junit4]   2>at 
org.apache.solr.update.TransactionLog$LogReader.next(TransactionLog.java:641)
 [junit4]   2>at 
org.apache.solr.update.CdcrTransactionLog$CdcrLogReader.next(CdcrTransactionLog.java:154)


From the comment of the LogReader#next() method, the contract should 
have been to return null if EOF is reached. However, this does not seem 
to be respected as per stack trace. Is it a bug and should I open an 
issue to fix it ? Or is it just the method comment that is not up to 
date (and should be probably fixed as well) ?


Thanks
--
Renaud Delbru


[jira] [Updated] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-11-13 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-8263:

Attachment: SOLR-8263-trunk-2.patch

A new version of the patch (this replaces the previous one) which includes a 
fix related to the write lock.
In the previous patch, the write lock was removed accidentally while 
re-initialising the update log with the new set of tlog files (the init method 
was creating a new instance of the VersionInfo). As a consequence there was a 
small time frame where updates were lost (a batch of documents were missed in 1 
over 10 runs). The fix introduces a new init method that preserves the original 
VersionInfo instance and therefore preserves the write lock.
I have run the test 50 times without seeing anymore the issue.

> Tlog replication could interfere with the replay of buffered updates
> 
>
> Key: SOLR-8263
> URL: https://issues.apache.org/jira/browse/SOLR-8263
> Project: Solr
>  Issue Type: Sub-task
>    Reporter: Renaud Delbru
>Assignee: Erick Erickson
> Attachments: SOLR-8263-trunk-1.patch, SOLR-8263-trunk-2.patch
>
>
> The current implementation of the tlog replication might interfere with the 
> replay of the buffered updates. The current tlog replication works as follow:
> 1) Fetch the the tlog files from the master
> 2) reset the update log before switching the tlog directory
> 3) switch the tlog directory and re-initialise the update log with the new 
> directory.
> Currently there is no logic to keep "buffered updates" while resetting and 
> reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-11-13 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004348#comment-15004348
 ] 

Renaud Delbru edited comment on SOLR-8263 at 11/13/15 5:38 PM:
---

A new version of the patch (this replaces the previous one) which includes a 
fix related to the write lock.
In the previous patch, the write lock was removed accidentally while 
re-initialising the update log with the new set of tlog files (the init method 
was creating a new instance of the VersionInfo). As a consequence there was a 
small time frame where updates were lost (a batch of documents was missed in 1 
over 10 runs). The fix introduces a new init method that preserves the original 
VersionInfo instance and therefore preserves the write lock.
I have run the test 50 times without seeing anymore the issue.


was (Author: rendel):
A new version of the patch (this replaces the previous one) which includes a 
fix related to the write lock.
In the previous patch, the write lock was removed accidentally while 
re-initialising the update log with the new set of tlog files (the init method 
was creating a new instance of the VersionInfo). As a consequence there was a 
small time frame where updates were lost (a batch of documents were missed in 1 
over 10 runs). The fix introduces a new init method that preserves the original 
VersionInfo instance and therefore preserves the write lock.
I have run the test 50 times without seeing anymore the issue.

> Tlog replication could interfere with the replay of buffered updates
> 
>
> Key: SOLR-8263
> URL: https://issues.apache.org/jira/browse/SOLR-8263
> Project: Solr
>  Issue Type: Sub-task
>    Reporter: Renaud Delbru
>Assignee: Erick Erickson
> Attachments: SOLR-8263-trunk-1.patch, SOLR-8263-trunk-2.patch
>
>
> The current implementation of the tlog replication might interfere with the 
> replay of the buffered updates. The current tlog replication works as follow:
> 1) Fetch the the tlog files from the master
> 2) reset the update log before switching the tlog directory
> 3) switch the tlog directory and re-initialise the update log with the new 
> directory.
> Currently there is no logic to keep "buffered updates" while resetting and 
> reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-11-12 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-8263:

Attachment: SOLR-8263-trunk-1.patch

[~shalinmangar][~erickerickson] An initial first patch for this issue. It 
includes a unit test that was able to produce the describe issue, and an 
initial fix for the issue.
The index fetcher is now taking care of moving the buffered updates of the 
previous update log to the new one. During the move, the index fetcher is 
blocking updates to ensure that no buffered updates will be missed.

> Tlog replication could interfere with the replay of buffered updates
> 
>
> Key: SOLR-8263
> URL: https://issues.apache.org/jira/browse/SOLR-8263
> Project: Solr
>  Issue Type: Sub-task
>    Reporter: Renaud Delbru
>Assignee: Erick Erickson
> Attachments: SOLR-8263-trunk-1.patch
>
>
> The current implementation of the tlog replication might interfere with the 
> replay of the buffered updates. The current tlog replication works as follow:
> 1) Fetch the the tlog files from the master
> 2) reset the update log before switching the tlog directory
> 3) switch the tlog directory and re-initialise the update log with the new 
> directory.
> Currently there is no logic to keep "buffered updates" while resetting and 
> reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-11-12 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002018#comment-15002018
 ] 

Renaud Delbru edited comment on SOLR-8263 at 11/12/15 12:50 PM:


[~shalinmangar] [~erickerickson] An initial first patch for this issue. It 
includes a unit test that was able to reproduce the described issue, and an 
initial fix for the issue.
The index fetcher is now taking care of moving the buffered updates from the 
previous update log to the new one. During the move, the index fetcher is 
blocking updates to ensure that no buffered updates are missed.


was (Author: rendel):
[~shalinmangar][~erickerickson] An initial first patch for this issue. It 
includes a unit test that was able to produce the describe issue, and an 
initial fix for the issue.
The index fetcher is now taking care of moving the buffered updates of the 
previous update log to the new one. During the move, the index fetcher is 
blocking updates to ensure that no buffered updates will be missed.

> Tlog replication could interfere with the replay of buffered updates
> 
>
> Key: SOLR-8263
> URL: https://issues.apache.org/jira/browse/SOLR-8263
> Project: Solr
>  Issue Type: Sub-task
>    Reporter: Renaud Delbru
>Assignee: Erick Erickson
> Attachments: SOLR-8263-trunk-1.patch
>
>
> The current implementation of the tlog replication might interfere with the 
> replay of the buffered updates. The current tlog replication works as follow:
> 1) Fetch the the tlog files from the master
> 2) reset the update log before switching the tlog directory
> 3) switch the tlog directory and re-initialise the update log with the new 
> directory.
> Currently there is no logic to keep "buffered updates" while resetting and 
> reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-8263) Tlog replication could interfere with the replay of buffered updates

2015-11-09 Thread Renaud Delbru (JIRA)
Renaud Delbru created SOLR-8263:
---

 Summary: Tlog replication could interfere with the replay of 
buffered updates
 Key: SOLR-8263
 URL: https://issues.apache.org/jira/browse/SOLR-8263
 Project: Solr
  Issue Type: Sub-task
Reporter: Renaud Delbru


The current implementation of the tlog replication might interfere with the 
replay of the buffered updates. The current tlog replication works as follow:
1) Fetch the the tlog files from the master
2) reset the update log before switching the tlog directory
3) switch the tlog directory and re-initialise the update log with the new 
directory.
Currently there is no logic to keep "buffered updates" while resetting and 
reinitializing the update log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6273) Cross Data Center Replication

2015-11-02 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6273:

Attachment: SOLR-6273-trunk-testfix6.patch

[~erickerickson] Find attached your patch with some fixes.
The cause of the NPE was that some replication handler tests were not running 
in cloud mode, and therefore the update log was null. I have added a fix in 
that. I have also fixed some merge issues with the latest trunk. The full Solr 
test suite was executed successfully.

[~shalinmangar] Regarding the potential issue with the transaction log 
replication, I will have a look this week. Should I open a sub-issue to track 
this separately ? 

> Cross Data Center Replication
> -
>
> Key: SOLR-6273
> URL: https://issues.apache.org/jira/browse/SOLR-6273
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Erick Erickson
> Attachments: SOLR-6273-trunk-testfix1.patch, 
> SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk-testfix3.patch, 
> SOLR-6273-trunk-testfix6.patch, SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, 
> SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, 
> forShalin.patch
>
>
> This is the master issue for Cross Data Center Replication (CDCR)
> described at a high level here: 
> http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-6273) Cross Data Center Replication

2015-11-02 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985508#comment-14985508
 ] 

Renaud Delbru edited comment on SOLR-6273 at 11/2/15 4:44 PM:
--

[~erickerickson] Find attached your patch with some fixes.
The cause of the NPE was that some replication handler tests were not running 
in cloud mode, and therefore the update log was null. I have added a simple fix 
for that issue. I have also fixed some merge issues with the latest trunk. The 
full Solr test suite was executed successfully.

[~shalinmangar] Regarding the potential issue with the transaction log 
replication, I will have a look this week. Should I open a sub-issue to track 
this separately ? 


was (Author: rendel):
[~erickerickson] Find attached your patch with some fixes.
The cause of the NPE was that some replication handler tests were not running 
in cloud mode, and therefore the update log was null. I have added a fix in 
that. I have also fixed some merge issues with the latest trunk. The full Solr 
test suite was executed successfully.

[~shalinmangar] Regarding the potential issue with the transaction log 
replication, I will have a look this week. Should I open a sub-issue to track 
this separately ? 

> Cross Data Center Replication
> -
>
> Key: SOLR-6273
> URL: https://issues.apache.org/jira/browse/SOLR-6273
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Erick Erickson
> Attachments: SOLR-6273-trunk-testfix1.patch, 
> SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk-testfix3.patch, 
> SOLR-6273-trunk-testfix6.patch, SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, 
> SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, 
> forShalin.patch
>
>
> This is the master issue for Cross Data Center Replication (CDCR)
> described at a high level here: 
> http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6273) Cross Data Center Replication

2015-10-22 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968987#comment-14968987
 ] 

Renaud Delbru commented on SOLR-6273:
-

The tlog replication is only relevant to the source cluster, as it ensures that 
tlogs are replicated between a master and slaves in case of a recovery (with a 
snappull). If not, then there are some scenarios where a slave can end up with 
an incomplete update log, and if it becomes the master, then we will miss some 
updates and the target cluster becomes inconsistent wrt the source cluster.


> Cross Data Center Replication
> -
>
> Key: SOLR-6273
> URL: https://issues.apache.org/jira/browse/SOLR-6273
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Erick Erickson
> Attachments: SOLR-6273-trunk-testfix1.patch, 
> SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk-testfix3.patch, 
> SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, SOLR-6273.patch, 
> SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch
>
>
> This is the master issue for Cross Data Center Replication (CDCR)
> described at a high level here: 
> http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6273) Cross Data Center Replication

2015-10-22 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969006#comment-14969006
 ] 

Renaud Delbru commented on SOLR-6273:
-

Yes, I think we should probably change the default value of the scheduler to 
1ms unless we change the model to a streaming one. 1000ms is way too high as 
default value.

> Cross Data Center Replication
> -
>
> Key: SOLR-6273
> URL: https://issues.apache.org/jira/browse/SOLR-6273
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Erick Erickson
> Attachments: SOLR-6273-trunk-testfix1.patch, 
> SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk-testfix3.patch, 
> SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, SOLR-6273.patch, 
> SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch
>
>
> This is the master issue for Cross Data Center Replication (CDCR)
> described at a high level here: 
> http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6273) Cross Data Center Replication

2015-10-22 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968970#comment-14968970
 ] 

Renaud Delbru commented on SOLR-6273:
-

[~shalinmangar] thanks for looking into this.

Regarding performance (2 and 3), it is true that the right batch size and 
scheduler delay is very important for optimal performance. With the proper 
batch sizes and scheduler delays, we have seen very low update latency between 
the source and target clusters. In your setup, one document was approximately 
0.2kb in size, therefore the batch size was ~14kb which should correspond to 
~14mb/s of transfer rate. With such a transfer rate, the replication should 
have been done in a few seconds / minutes, not hours. Could you give more 
information about your setup / benchmark ? Were replication turned off while 
you were indexing on the source, or you turned it on after ?

In term of moving from a batch model to to a pure streaming one, this might 
probably simplify the configuration on the user size, but in term of 
performance, I am not sure - maybe some other people can give their opinion 
here. Batch size might not use that much memory (if properly configured), and 
transfer speed also (if the batch size is properly configured too). One way to 
simplify also the configuration for the user is, like you proposed, having a 
configurable transfer rate but with some logic to automatically adjust the 
batch size and scheduler delay based on the configurable transfer rate ?

About 5, I think transfer rate is a good addition. Latency could be computed as 
the QUEUES monitoring action is returning the last document timestamp.


> Cross Data Center Replication
> -
>
> Key: SOLR-6273
> URL: https://issues.apache.org/jira/browse/SOLR-6273
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Erick Erickson
> Attachments: SOLR-6273-trunk-testfix1.patch, 
> SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk-testfix3.patch, 
> SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, SOLR-6273.patch, 
> SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch
>
>
> This is the master issue for Cross Data Center Replication (CDCR)
> described at a high level here: 
> http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6273) Cross Data Center Replication

2015-10-22 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969188#comment-14969188
 ] 

Renaud Delbru commented on SOLR-6273:
-

First time I saw this issue.
How did you perform the reload ? Have you deleted it the source collection 
before the reload, or just reload and overwrite the existing documents ?

> Cross Data Center Replication
> -
>
> Key: SOLR-6273
> URL: https://issues.apache.org/jira/browse/SOLR-6273
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Erick Erickson
> Attachments: SOLR-6273-trunk-testfix1.patch, 
> SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk-testfix3.patch, 
> SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, SOLR-6273.patch, 
> SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch
>
>
> This is the master issue for Cross Data Center Replication (CDCR)
> described at a high level here: 
> http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6273) Cross Data Center Replication

2015-10-22 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969182#comment-14969182
 ] 

Renaud Delbru commented on SOLR-6273:
-

That's a good point, and I think the current implementation might interfere 
with the replay of the buffered updates. The current tlog replication works as 
follow:
1) Fetch the the tlog files from the master
2) reset the update log before switching the tlog directory
3) switch the tlog directory and re-initialise the update log with the new 
directory.
Currently there is no logic to keep "buffered updates" while resetting and 
reinitializing the update log. It looks like the tlog replication still needs 
some work.

> Cross Data Center Replication
> -
>
> Key: SOLR-6273
> URL: https://issues.apache.org/jira/browse/SOLR-6273
> Project: Solr
>  Issue Type: New Feature
>Reporter: Yonik Seeley
>Assignee: Erick Erickson
> Attachments: SOLR-6273-trunk-testfix1.patch, 
> SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk-testfix3.patch, 
> SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, SOLR-6273.patch, 
> SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch
>
>
> This is the master issue for Cross Data Center Replication (CDCR)
> described at a high level here: 
> http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Strange comment in CdcrReplicationHandlerTest.java

2015-07-30 Thread Renaud Delbru
Yes, my apologies for this, I didn't catch this one when I reviewed the 
code before commit.

--
Renaud Delbru

On 29/07/15 23:29, Erick Erickson wrote:

Standard apache license, this was just a couple of erroneous lines at the top,
I suspect auto-added to by his IDE, I messed it too.

Will fix this in the next week, I'm traveling right now.

To Whit.

/*
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
  * The ASF licenses this file to You under the Apache License, Version 2.0
  * (the License); you may not use this file except in compliance with
  * the License.  You may obtain a copy of the License at
  *
  * http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an AS IS BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */

On Wed, Jul 29, 2015 at 4:27 PM, Uwe Schindler u...@thetaphi.de wrote:

RAT would only fail if the license header is missing completely. I don't think it checks 
for copyright notices.

If there is no license header, we should check our RAT config! What does it 
list as license for that file?

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, July 29, 2015 7:36 PM
To: dev@lucene.apache.org
Subject: Re: Strange comment in CdcrReplicationHandlerTest.java

Yeah, I wondered that myself.

On Wed, Jul 29, 2015 at 1:35 PM, Ramkumar R. Aiyengar
andyetitmo...@gmail.com wrote:

Hmm.. I would have expected rat to fail this in precommit actually..

On 29 Jul 2015 18:01, Timothy Potter thelabd...@gmail.com wrote:


Why is this in the code?

/**
  * Copyright (c) 2015 Renaud Delbru. All Rights Reserved.
  */
package org.apache.solr.cloud;

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
additional commands, e-mail: dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6461) peer cluster configuration

2015-07-03 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613126#comment-14613126
 ] 

Renaud Delbru commented on SOLR-6461:
-

Yes, it has been fixed by the superset issue SOLR-6273. The configuration of 
the target clusters is done through the Replica Parameters (see 
[manual|https://docs.google.com/document/d/1DZHUFM3z9OX171DeGjcLTRI9uULM-NB1KsCSpVL3Zy0/edit#]).
 It consists of 3 parameters: zkHost to indicate the address of the zookeeper 
of the target cluster, source to indicate the source collection to replicate, 
and target to indicate the target collection that will receive updates. 

 peer cluster configuration
 --

 Key: SOLR-6461
 URL: https://issues.apache.org/jira/browse/SOLR-6461
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley

 From http://heliosearch.org/solr-cross-data-center-replication/#Overview
 Clusters will be configured to know about each other, most likely through 
 keeping a cluster peer list in zookeeper. One essential piece of information 
 will be the zookeeper quorum address for each cluster peer. Any node in one 
 cluster can know the configuration of another cluster via a zookeeper 
 client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6273) Cross Data Center Replication

2015-07-03 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613124#comment-14613124
 ] 

Renaud Delbru commented on SOLR-6273:
-

Hi Martin,

The google doc is up to date with the current implementation. One suggestion is 
for tuning the performance of the replication. The performance of the 
replication depends on the Replicator Parameters. In your scenario, the two 
main parameters will be schedule and batchSize. If you would like to see a 
very small latency between replication batches, you can decrease the schedule 
parameter from 1000ms to 1ms. To improve the network IO, you can also try to 
increase the batchSize parameter to a larger number (if your documents are a 
few kbs or less, you can try to increase it to 500, 1000 or more). 

To measure the impact that the parameters have on the replication performance, 
you can use the monitoring api, e.g., ?action=QUEUES, to retrieve some stats 
about the replication queue. The queue size will tell you how much your replica 
lags behind the source cluster. If the replication is not fast enough, you'll 
see the queue size increasing. The idea is to try to tune the schedule and 
batchSize parameters until you find the optimal values for your collection and 
setup, and see this queue being relatively stable and small.

 Cross Data Center Replication
 -

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Assignee: Erick Erickson
 Attachments: SOLR-6273-trunk-testfix1.patch, 
 SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, 
 SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6273) Cross Data Center Replication

2015-06-03 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6273:

Attachment: SOLR-6273-trunk-testfix2.patch

[~erickerickson], I have attached a new patch regarding the unit test failures 
from the jenkins job. It is likely that the errors we saw are due to the 
jenkins server being under heavy load and therefore less responsive, which 
might trigger race condition issues in the assertions of the unit tests.
I have added various safeguard methods to the unit test framework, so that the 
it will wait for the completion of particular tasks (cdcr state replication, 
update log cleaning, etc.) and fail after a given timeout (15s).

 Cross Data Center Replication
 -

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Assignee: Erick Erickson
 Attachments: SOLR-6273-trunk-testfix1.patch, 
 SOLR-6273-trunk-testfix2.patch, SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, 
 SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6273) Cross Data Center Replication

2015-05-26 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6273:

Attachment: SOLR-6273-trunk-testfix1.patch

[~erickerickson], I was able to reproduce the issues from the failed jinkins 
build. After replicating the tlog files, the update log of the slave is not 
properly re-initialised, and it still contains references to the previous 
tlog files. I have attached a fix for this.

 Cross Data Center Replication
 -

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Assignee: Erick Erickson
 Attachments: SOLR-6273-trunk-testfix1.patch, SOLR-6273-trunk.patch, 
 SOLR-6273-trunk.patch, SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, 
 SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6273) Cross Data Center Replication

2015-05-21 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554099#comment-14554099
 ] 

Renaud Delbru commented on SOLR-6273:
-

[~erickerickson] I have checked the new patch on the latest trunk. The unit 
tests seem to properly run with the latest changes. Thanks for porting this to 
trunk.

 Cross Data Center Replication
 -

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Assignee: Erick Erickson
 Attachments: SOLR-6273-trunk.patch, SOLR-6273-trunk.patch, 
 SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6273) Cross Data Center Replication

2015-04-27 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6273:

Attachment: SOLR-6273.patch

Here is a new patch with the following changes:

- Renamed 'slice' into 'shard'

- Removed an optimisation in the replication of tlog files which could lead to 
duplicate tlog entries on a slave node. We were trying to avoid transferring 
tlog files that were already present on the slave nodes in order to reduce 
network transfer. However, tlog files between the master and slave can differ, 
overlap, etc. making the comparison difficult to achieve. We removed this 
optimisation and now during a recovery the tlog replication will transfer all 
the tlog files from the master to the slave, and replace on the slave node all 
the existing tlog files.

 Cross Data Center Replication
 -

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Assignee: Erick Erickson
 Attachments: SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch, 
 SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6273) Cross Data Center Replication

2015-04-22 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507324#comment-14507324
 ] 

Renaud Delbru commented on SOLR-6273:
-

Hi,

[~erickerickson]: From the original subtasks, the ones that are not covered 
with this patch are: SOLR-6465 and SOLR-6466.

[~grishick]: The current patch does not cover the auto-provisioning of 
collections / live configuration of peer clusters. I think this issue should be 
tackled as part of SOLR-6466.

[~janhoy]: Could you point to where *slice* is being used instead of *shard* ? 
This should not be a problem to change that.


 Cross Data Center Replication
 -

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Assignee: Erick Erickson
 Attachments: SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6273) Cross Data Center Replication

2015-04-15 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6273:

Attachment: SOLR-6273.patch

A new version of the patch. The patch has been created from the latest 
branch_5x. The full Solr test suite has been executed successfully (there were 
a few timeouts in some of the tests, but this seems irrelevant to this patch). 
The principal change in this new version includes a fix for the replication of 
tlog files. The {{ReplicationHandler}} and {{IndexFetcher}} have been modifed 
to replicate tlog files during a recovery (only if CDCR is activated). Some 
unit tests covering various scenarios can be found
in {{core/src/test/org/apache/solr/cloud/CdcrReplicationHandlerTest.java}}.
In addition of the suite of automated unit tests, this version has been tested 
in various real deployments. One client has extensively tested the robustness 
and performance of CDCR in pre-prod, and is satisfied with the results. 

We think that the code is in a relatively good state to be pushed to Solr. How 
can we move forward from here ?

 Cross Data Center Replication
 -

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
 Attachments: SOLR-6273.patch, SOLR-6273.patch, SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Behavior of JettySolrRunner#start wrt Solr data/tlog directories

2015-03-23 Thread Renaud Delbru

Hi Alan,

Thanks for your feedback,
Indeed after your reply, I have investigated a bit more, and discovered 
that this was the UpdateHandler's init that was clearing the tlog 
directory if a non persistent directory is used. The solution is to 
switch to a persistent directory factory for my tests.

--
Renaud Delbru

On 03/23/2015 04:32 PM, Alan Woodward wrote:

Hi Renaud,

I don't think there's anything special in JettySolrRunner that cleans up
old directories, and the various Replication tests do something very
similar to what you want here - are you sure it's the JSR code that's
removing files here?

Alan Woodward
www.flax.co.uk http://www.flax.co.uk


On 23 Mar 2015, at 16:20, Renaud Delbru wrote:


Dear all,

I am currently working on the SOLR-6273 (CDCR) and I am currently
facing an issue with the Solr test framework. I am trying to write a
unit test where the slave node is stopped then restarted during the
execution of the unit test, in order to verify the replication of tlog
files (something that is introduced by CDCR). The scenario is the
following:
- instantiate a master and slave node
- send a first batch of updates to the master
- stop the slave
- send a second batch of updates to the mater
- restart the slave in order to trigger replication
- verify that the update logs between the master and slaves are
properly replicated.

The problem I am facing is that whenever I restart the slave, using
the SolrJettyRunner.start() method, the Solr data directory and tlog
subdirectory are cleaned up, and not reused. Therefore I am unable to
test the scenario where the slave has some partial tlog files.

Is there a way to tell the jetty server to reuse the Solr data
directory / tlog directory instead of erasing it ? Or is there another
way to emulate that a slave node is down ?

Thanks
--
Renaud Delbru

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
mailto:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
mailto:dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Behavior of JettySolrRunner#start wrt Solr data/tlog directories

2015-03-23 Thread Renaud Delbru

Dear all,

I am currently working on the SOLR-6273 (CDCR) and I am currently facing 
an issue with the Solr test framework. I am trying to write a unit test 
where the slave node is stopped then restarted during the execution of 
the unit test, in order to verify the replication of tlog files 
(something that is introduced by CDCR). The scenario is the following:

- instantiate a master and slave node
- send a first batch of updates to the master
- stop the slave
- send a second batch of updates to the mater
- restart the slave in order to trigger replication
- verify that the update logs between the master and slaves are properly 
replicated.


The problem I am facing is that whenever I restart the slave, using the 
SolrJettyRunner.start() method, the Solr data directory and tlog 
subdirectory are cleaned up, and not reused. Therefore I am unable to 
test the scenario where the slave has some partial tlog files.


Is there a way to tell the jetty server to reuse the Solr data directory 
/ tlog directory instead of erasing it ? Or is there another way to 
emulate that a slave node is down ?


Thanks
--
Renaud Delbru

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6460) Keep transaction logs around longer

2014-12-11 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6460:

Attachment: SOLR-6460.patch

The latest version of the update log extension for cdcr. In addition of the 
previously described features, we extended the transaction log to compute and 
store the number of records in a tlog file. 
The patch SOLR-6819 is required for executing the unit tests.

 Keep transaction logs around longer
 ---

 Key: SOLR-6460
 URL: https://issues.apache.org/jira/browse/SOLR-6460
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley
 Attachments: SOLR-6460.patch, SOLR-6460.patch, SOLR-6460.patch, 
 SOLR-6460.patch


 Transaction logs are currently deleted relatively quickly... but we need to 
 keep them around much longer to be used as a source for cross-datacenter 
 recovery.  This will also be useful in the future for enabling peer-sync to 
 use more historical updates before falling back to replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6460) Keep transaction logs around longer

2014-12-10 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241396#comment-14241396
 ] 

Renaud Delbru commented on SOLR-6460:
-

Cdcr is dependent on those modifcations, but this extension is not dependent of 
Cdcr.
All the modifications were implemented as an extension of the original update 
log. The reason to keep it separated was to avoid to push unexpected problems 
in the other parts of Solr Cloud. 
This extension can be easily integrated into the original update log / 
transaction log. Maybe this could be integrated when we will be more confident 
with it.


 Keep transaction logs around longer
 ---

 Key: SOLR-6460
 URL: https://issues.apache.org/jira/browse/SOLR-6460
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley
 Attachments: SOLR-6460.patch, SOLR-6460.patch, SOLR-6460.patch


 Transaction logs are currently deleted relatively quickly... but we need to 
 keep them around much longer to be used as a source for cross-datacenter 
 recovery.  This will also be useful in the future for enabling peer-sync to 
 use more historical updates before falling back to replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6819) Being able to configure the updates log implementation from solrconfig.xml

2014-12-10 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6819:

Attachment: SOLR-6819.patch

A new patch that reverts the original behavior of the update handler regarding 
the hdfs update log instantiation, in order to avoid back compatibility problem.

 Being able to configure the updates log implementation from solrconfig.xml
 --

 Key: SOLR-6819
 URL: https://issues.apache.org/jira/browse/SOLR-6819
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud, update
Affects Versions: Trunk
Reporter: Renaud Delbru
 Attachments: SOLR-6819.patch, SOLR-6819.patch


 CDCR requires its own implementation of the updates log. At the moment, there 
 is no way to configure the class to use when instantiating the updates log. 
 The UpdateHandler is deciding to instantiate the base class UpdateLog or its 
 hdfs version HdfsUpdateLog based on the directory path prefix.
 We can extend the UpdateHandler to allow for a class parameter to be defined 
 for the updateLog section of the solrconfig.xml. For example, the relevant 
 part of the solrconfig,xml will look like:
 {code:xml}
   updateHandler class=solr.DirectUpdateHandler2
 updateLog class=solr.CdcrUpdateLog
   str name=dir${solr.ulog.dir:}/str
 /updateLog
   /updateHandler
 {code} 
 where the updateLog entry has a parameter class that indicates that the 
 CdcrUpdateLog implementation must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-6823) Improve extensibility of DistributedUpdateProcessor regarding version processing

2014-12-05 Thread Renaud Delbru (JIRA)
Renaud Delbru created SOLR-6823:
---

 Summary: Improve extensibility of DistributedUpdateProcessor 
regarding version processing
 Key: SOLR-6823
 URL: https://issues.apache.org/jira/browse/SOLR-6823
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud, update
Affects Versions: Trunk
Reporter: Renaud Delbru


As described in 6462, 
{quote}
doDeleteByQuery() is structured differently than processAdd() and 
processDelete() in DistributedUpdateProcessor. We refactored doDeleteByQuery() 
by extracting a portion of its code into a helper method versionDeleteByQuery() 
which is then overriden in the CdcrUpdateProcessor. This way doDeleteByQuery() 
is structurally similar to the other two cases and we are able to keep the CDCR 
logic completely separated.
{quote}

This issue provides a patch for the DisitrbutedUpdateProcessor for trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6823) Improve extensibility of DistributedUpdateProcessor regarding version processing

2014-12-05 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6823:

Attachment: SOLR-6823.patch

 Improve extensibility of DistributedUpdateProcessor regarding version 
 processing
 

 Key: SOLR-6823
 URL: https://issues.apache.org/jira/browse/SOLR-6823
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud, update
Affects Versions: Trunk
Reporter: Renaud Delbru
 Attachments: SOLR-6823.patch


 As described in 6462, 
 {quote}
 doDeleteByQuery() is structured differently than processAdd() and 
 processDelete() in DistributedUpdateProcessor. We refactored 
 doDeleteByQuery() by extracting a portion of its code into a helper method 
 versionDeleteByQuery() which is then overriden in the CdcrUpdateProcessor. 
 This way doDeleteByQuery() is structurally similar to the other two cases and 
 we are able to keep the CDCR logic completely separated.
 {quote}
 This issue provides a patch for the DisitrbutedUpdateProcessor for trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6273) Cross Data Center Replicaton

2014-12-05 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6273:

Attachment: SOLR-6273.patch

The initial patch for cdcr for trunk. It contains a working version of the 
cross data center replication for active-passive scenarios. The 
CdcrRequestHandler provides an API to control and monitor the replication. A 
documentation on how to configure cdcr and of the API can be found 
[here|https://docs.google.com/document/d/1DZHUFM3z9OX171DeGjcLTRI9uULM-NB1KsCSpVL3Zy0/edit?usp=sharing].
This patch includes the following patches: 6621, 6819, 6823, and a few minor 
modifications on the UpdateLog and TransactionLog classes. Other than that, the 
rest of the CDCR code simply extends the Solr Core code.

 Cross Data Center Replicaton
 

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
 Attachments: SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6823) Improve extensibility of DistributedUpdateProcessor regarding version processing

2014-12-05 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6823:

Description: 
As described in SOLR-6462, 
{quote}
doDeleteByQuery() is structured differently than processAdd() and 
processDelete() in DistributedUpdateProcessor. We refactored doDeleteByQuery() 
by extracting a portion of its code into a helper method versionDeleteByQuery() 
which is then overriden in the CdcrUpdateProcessor. This way doDeleteByQuery() 
is structurally similar to the other two cases and we are able to keep the CDCR 
logic completely separated.
{quote}

This issue provides a patch for the DisitrbutedUpdateProcessor for trunk.

  was:
As described in 6462, 
{quote}
doDeleteByQuery() is structured differently than processAdd() and 
processDelete() in DistributedUpdateProcessor. We refactored doDeleteByQuery() 
by extracting a portion of its code into a helper method versionDeleteByQuery() 
which is then overriden in the CdcrUpdateProcessor. This way doDeleteByQuery() 
is structurally similar to the other two cases and we are able to keep the CDCR 
logic completely separated.
{quote}

This issue provides a patch for the DisitrbutedUpdateProcessor for trunk.


 Improve extensibility of DistributedUpdateProcessor regarding version 
 processing
 

 Key: SOLR-6823
 URL: https://issues.apache.org/jira/browse/SOLR-6823
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud, update
Affects Versions: Trunk
Reporter: Renaud Delbru
 Attachments: SOLR-6823.patch


 As described in SOLR-6462, 
 {quote}
 doDeleteByQuery() is structured differently than processAdd() and 
 processDelete() in DistributedUpdateProcessor. We refactored 
 doDeleteByQuery() by extracting a portion of its code into a helper method 
 versionDeleteByQuery() which is then overriden in the CdcrUpdateProcessor. 
 This way doDeleteByQuery() is structurally similar to the other two cases and 
 we are able to keep the CDCR logic completely separated.
 {quote}
 This issue provides a patch for the DisitrbutedUpdateProcessor for trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-6273) Cross Data Center Replicaton

2014-12-05 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14235662#comment-14235662
 ] 

Renaud Delbru edited comment on SOLR-6273 at 12/5/14 3:58 PM:
--

The initial patch for cdcr for trunk. It contains a working version of the 
cross data center replication for active-passive scenarios. The 
CdcrRequestHandler provides an API to control and monitor the replication. A 
documentation on how to configure cdcr and of the API can be found 
[here|https://docs.google.com/document/d/1DZHUFM3z9OX171DeGjcLTRI9uULM-NB1KsCSpVL3Zy0/edit?usp=sharing].
This patch includes the following patches: SOLR-6621, SOLR-6819, SOLR-6823, and 
a few minor modifications on the UpdateLog and TransactionLog classes. Other 
than that, the rest of the CDCR code simply extends the Solr Core code.


was (Author: rendel):
The initial patch for cdcr for trunk. It contains a working version of the 
cross data center replication for active-passive scenarios. The 
CdcrRequestHandler provides an API to control and monitor the replication. A 
documentation on how to configure cdcr and of the API can be found 
[here|https://docs.google.com/document/d/1DZHUFM3z9OX171DeGjcLTRI9uULM-NB1KsCSpVL3Zy0/edit?usp=sharing].
This patch includes the following patches: 6621, 6819, 6823, and a few minor 
modifications on the UpdateLog and TransactionLog classes. Other than that, the 
rest of the CDCR code simply extends the Solr Core code.

 Cross Data Center Replicaton
 

 Key: SOLR-6273
 URL: https://issues.apache.org/jira/browse/SOLR-6273
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
 Attachments: SOLR-6273.patch


 This is the master issue for Cross Data Center Replication (CDCR)
 described at a high level here: 
 http://heliosearch.org/solr-cross-data-center-replication/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-6819) Being able to configure the updates log implementation from solrconfig.xml

2014-12-04 Thread Renaud Delbru (JIRA)
Renaud Delbru created SOLR-6819:
---

 Summary: Being able to configure the updates log implementation 
from solrconfig.xml
 Key: SOLR-6819
 URL: https://issues.apache.org/jira/browse/SOLR-6819
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud, update
Affects Versions: Trunk
Reporter: Renaud Delbru


CDCR requires its own implementation of the updates log. At the moment, there 
is no way to configure the class to use when instantiating the updates log. The 
UpdateHandler is deciding to instantiate the base class UpdateLog or its hdfs 
version HdfsUpdateLog based on the directory path prefix.
We can extend the UpdateHandler to allow for a class parameter to be defined 
for the updateLog section of the solrconfig.xml. For example, the relevant part 
of the solrconfig,xml will look like:
{code:xml}
  updateHandler class=solr.DirectUpdateHandler2
updateLog class=solr.CdcrUpdateLog
  str name=dir${solr.ulog.dir:}/str
/updateLog
  /updateHandler
{code} 
where the updateLog entry has a parameter class that indicates that the 
CdcrUpdateLog implementation must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6819) Being able to configure the updates log implementation from solrconfig.xml

2014-12-04 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6819:

Attachment: SOLR-6819.patch

The patch implementing the extension to configure the class of the updates log. 
This makes also the update log initialisation part of the UpdateHandler more 
cleaner. Specific instructions for the configuration of the HdfsUpdateLog has 
been moved withing HdfsUpdateLog itself.

 Being able to configure the updates log implementation from solrconfig.xml
 --

 Key: SOLR-6819
 URL: https://issues.apache.org/jira/browse/SOLR-6819
 Project: Solr
  Issue Type: Improvement
  Components: SolrCloud, update
Affects Versions: Trunk
Reporter: Renaud Delbru
 Attachments: SOLR-6819.patch


 CDCR requires its own implementation of the updates log. At the moment, there 
 is no way to configure the class to use when instantiating the updates log. 
 The UpdateHandler is deciding to instantiate the base class UpdateLog or its 
 hdfs version HdfsUpdateLog based on the directory path prefix.
 We can extend the UpdateHandler to allow for a class parameter to be defined 
 for the updateLog section of the solrconfig.xml. For example, the relevant 
 part of the solrconfig,xml will look like:
 {code:xml}
   updateHandler class=solr.DirectUpdateHandler2
 updateLog class=solr.CdcrUpdateLog
   str name=dir${solr.ulog.dir:}/str
 /updateLog
   /updateHandler
 {code} 
 where the updateLog entry has a parameter class that indicates that the 
 CdcrUpdateLog implementation must be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6621) SolrZkClient does not guarantee that a watch object will only be triggered once for a given notification

2014-10-16 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173785#comment-14173785
 ] 

Renaud Delbru commented on SOLR-6621:
-

I have added some comments, and created a pull request at:
https://github.com/apache/lucene-solr/pull/100

 SolrZkClient does not guarantee that a watch object will only be triggered 
 once for a given notification
 

 Key: SOLR-6621
 URL: https://issues.apache.org/jira/browse/SOLR-6621
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: Trunk
Reporter: Renaud Delbru
 Attachments: SOLR-6621


 The SolrZkClient provides methods such as getData or exists. The problem is 
 that the client automatically wraps the provided watcher with a new watcher 
 (see 
 [here|https://github.com/apache/lucene-solr/blob/6ead83a6fafbdd6c444e2a837b09eccf34a255ef/solr/solrj/src/java/org/apache/solr/common/cloud/SolrZkClient.java#L255])
  which breaks the guarantee that a watch object, or function/context pair, 
 will only be triggered once for a given notification. This creates 
 undesirable effects when we are registering the same watch is the Watcher 
 callback method.
 A possible solution would be to introduce a SolrZkWatcher class, that will 
 take care of submitting the job to the zkCallbackExecutor. Components in 
 SolrCloud will extend this class and implement their own callback method. 
 This will ensure that the watcher object that zookeeper receives remains the 
 same.
 See SOLR-6462 for background information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6621) SolrZkClient does not guarantee that a watch object will only be triggered once for a given notification

2014-10-14 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6621:

Attachment: SOLR-6621

Hi,

Were you thinking of something like this ?

 SolrZkClient does not guarantee that a watch object will only be triggered 
 once for a given notification
 

 Key: SOLR-6621
 URL: https://issues.apache.org/jira/browse/SOLR-6621
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: Trunk
Reporter: Renaud Delbru
 Attachments: SOLR-6621


 The SolrZkClient provides methods such as getData or exists. The problem is 
 that the client automatically wraps the provided watcher with a new watcher 
 (see 
 [here|https://github.com/apache/lucene-solr/blob/6ead83a6fafbdd6c444e2a837b09eccf34a255ef/solr/solrj/src/java/org/apache/solr/common/cloud/SolrZkClient.java#L255])
  which breaks the guarantee that a watch object, or function/context pair, 
 will only be triggered once for a given notification. This creates 
 undesirable effects when we are registering the same watch is the Watcher 
 callback method.
 A possible solution would be to introduce a SolrZkWatcher class, that will 
 take care of submitting the job to the zkCallbackExecutor. Components in 
 SolrCloud will extend this class and implement their own callback method. 
 This will ensure that the watcher object that zookeeper receives remains the 
 same.
 See SOLR-6462 for background information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6621) SolrZkClient does not guarantee that a watch object will only be triggered once for a given notification

2014-10-14 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170918#comment-14170918
 ] 

Renaud Delbru commented on SOLR-6621:
-

Ok, if it is good, I'll add some documentation on the wrapWatcher method and 
upload a new patch.

 SolrZkClient does not guarantee that a watch object will only be triggered 
 once for a given notification
 

 Key: SOLR-6621
 URL: https://issues.apache.org/jira/browse/SOLR-6621
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: Trunk
Reporter: Renaud Delbru
 Attachments: SOLR-6621


 The SolrZkClient provides methods such as getData or exists. The problem is 
 that the client automatically wraps the provided watcher with a new watcher 
 (see 
 [here|https://github.com/apache/lucene-solr/blob/6ead83a6fafbdd6c444e2a837b09eccf34a255ef/solr/solrj/src/java/org/apache/solr/common/cloud/SolrZkClient.java#L255])
  which breaks the guarantee that a watch object, or function/context pair, 
 will only be triggered once for a given notification. This creates 
 undesirable effects when we are registering the same watch is the Watcher 
 callback method.
 A possible solution would be to introduce a SolrZkWatcher class, that will 
 take care of submitting the job to the zkCallbackExecutor. Components in 
 SolrCloud will extend this class and implement their own callback method. 
 This will ensure that the watcher object that zookeeper receives remains the 
 same.
 See SOLR-6462 for background information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-6621) SolrZkClient does not guarantee that a watch object will only be triggered once for a given notification

2014-10-13 Thread Renaud Delbru (JIRA)
Renaud Delbru created SOLR-6621:
---

 Summary: SolrZkClient does not guarantee that a watch object will 
only be triggered once for a given notification
 Key: SOLR-6621
 URL: https://issues.apache.org/jira/browse/SOLR-6621
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: Trunk
Reporter: Renaud Delbru


The SolrZkClient provides methods such as getData or exists. The problem is 
that the client automatically wraps the provided watcher with a new watcher 
(see 
[here|https://github.com/apache/lucene-solr/blob/6ead83a6fafbdd6c444e2a837b09eccf34a255ef/solr/solrj/src/java/org/apache/solr/common/cloud/SolrZkClient.java#L255])
 which breaks the guarantee that a watch object, or function/context pair, 
will only be triggered once for a given notification. This creates undesirable 
effects when we are registering the same watch is the Watcher callback method.

A possible solution would be to introduce a SolrZkWatcher class, that will take 
care of submitting the job to the zkCallbackExecutor. Components in SolrCloud 
will extend this class and implement their own callback method. This will 
ensure that the watcher object that zookeeper receives remains the same.

See SOLR-6462 for background information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6462) forward updates asynchronously to peer clusters/leaders

2014-10-13 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169296#comment-14169296
 ] 

Renaud Delbru commented on SOLR-6462:
-

I have started to implement the CDCR request handler that will handle CDCR 
life-cycle actions and forward updates to the peer clusters.
While trying to implement the synchronisation of the life-cycle status amongst 
all the nodes of a cluster by using zookeeper, I have encountered a limitation 
of the SolrZkClient. The SolrZkClient provides methods such as getData or 
exists. The problem is that the client automatically wraps the provided watcher 
with a new watcher (see 
[here|https://github.com/apache/lucene-solr/blob/6ead83a6fafbdd6c444e2a837b09eccf34a255ef/solr/solrj/src/java/org/apache/solr/common/cloud/SolrZkClient.java#L255])
 which breaks the guarantee that a watch object, or function/context pair, 
will only be triggered once for a given notification. This creates undesirable 
effects when we are registering the same watch is the Watcher callback method.

I have created the issue SOLR-6621 to notify about the problem.

 forward updates asynchronously to peer clusters/leaders
 ---

 Key: SOLR-6462
 URL: https://issues.apache.org/jira/browse/SOLR-6462
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley

 http://heliosearch.org/solr-cross-data-center-replication/#UpdateFlow
 - An update will be received by the shard leader and versioned
 - Update will be sent from the leader to it’s replicas
 - Concurrently, update will be sent (synchronously or asynchronously) to the 
 shard leader in other clusters
 - Shard leader in the other cluster will receive already versioned update 
 (and not re-version it), and forward the update to it’s replicas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6460) Keep transaction logs around longer

2014-10-02 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6460:

Attachment: SOLR-6460.patch

Here is the latest patch which includes an optimisation to reduce the number of 
opened files and some code cleaning. To summarise, the current patch provides 
the following:

h4. Cleaning of Old Transaction Logs

The CdcrUpdateLog removes old tlogs based on pointers instead of a fixed size 
limit.

h4. Log Reader

The CdcrUpdateLog provides a log reader with scan and seek operations. A log 
reader is associated to a log pointer, and is taking care of the life-cycle of 
the pointer.

h4. Log Index

To improve the efficiency of the seek operation of the log reader, an index of 
transaction log files have been added. This index enables to quickly lookup a 
tlog file based on a version number. This index is implemented by adding a 
version number to the tlog filename and by leveraging the file system index. 
This solution was choosen as it was simpler and more robust than managing a 
separate disk-based index.

h4. Number of Opened Files

TransactionLog has been extended to automatically (1) close the output stream 
when its refeference count reach 0, and (2) reopen the output stream on demand. 
The new tlog (the current tlog being written) is kept open at all time. When a 
transaction log is pushed to the old tlog list, its reference count is 
decremented, which might trigger the closing of the output stream. 
The output stream is reopened in two cases:
* during recovery, to write a commit to the end of an uncapped tlog file;
* when a log reader is accessing it.

At the moment, the logic is splitted into two classes (TransactionLog and 
CdcrTransactionLog). We should probably merge the two in the final version.

h4. Integration within the UpdateHandler

There is a nocommit in the UpdateHandler to force the instantiation of the 
CdcrUpdateLog instead of the UpdateLog. We need to decide how user will 
configure this and modify the UpdateHandler appropriately.


 Keep transaction logs around longer
 ---

 Key: SOLR-6460
 URL: https://issues.apache.org/jira/browse/SOLR-6460
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley
 Attachments: SOLR-6460.patch, SOLR-6460.patch, SOLR-6460.patch


 Transaction logs are currently deleted relatively quickly... but we need to 
 keep them around much longer to be used as a source for cross-datacenter 
 recovery.  This will also be useful in the future for enabling peer-sync to 
 use more historical updates before falling back to replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-6460) Keep transaction logs around longer

2014-10-02 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156672#comment-14156672
 ] 

Renaud Delbru edited comment on SOLR-6460 at 10/2/14 3:47 PM:
--

Here is the latest patch which includes an optimisation to reduce the number of 
opened files and some code cleaning. To summarise, the current patch provides 
the following:

h4. Cleaning of Old Transaction Logs

The CdcrUpdateLog removes old tlogs based on pointers instead of a fixed size 
limit.

h4. Log Reader

The CdcrUpdateLog provides a log reader with scan and seek operations. A log 
reader is associated to a log pointer, and is taking care of the life-cycle of 
the pointer.

h4. Log Index

To improve the efficiency of the seek operation of the log reader, an index of 
transaction log files has been added. This index enables to quickly lookup a 
tlog file based on a version number. This index is implemented by adding a 
version number to the tlog filename and by leveraging the file system index. 
This solution was choosen as it was simpler and more robust than managing a 
separate disk-based index.

h4. Number of Opened Files

TransactionLog has been extended to automatically (1) close the output stream 
when its refeference count reaches 0, and (2) reopen the output stream on 
demand. 
The new tlog (the current tlog being written) is kept open at all time. When a 
transaction log is pushed to the old tlog list, its reference count is 
decremented, which might trigger the closing of the output stream. 
The output stream is reopened in two cases:
* during recovery, to write a commit to the end of an uncapped tlog file;
* when a log reader is accessing it.

At the moment, the logic is splitted into two classes (TransactionLog and 
CdcrTransactionLog). We should probably merge the two in the final version.

h4. Integration within the UpdateHandler

There is a nocommit in the UpdateHandler to force the instantiation of the 
CdcrUpdateLog instead of the UpdateLog. We need to decide how user will 
configure this and modify the UpdateHandler appropriately.



was (Author: rendel):
Here is the latest patch which includes an optimisation to reduce the number of 
opened files and some code cleaning. To summarise, the current patch provides 
the following:

h4. Cleaning of Old Transaction Logs

The CdcrUpdateLog removes old tlogs based on pointers instead of a fixed size 
limit.

h4. Log Reader

The CdcrUpdateLog provides a log reader with scan and seek operations. A log 
reader is associated to a log pointer, and is taking care of the life-cycle of 
the pointer.

h4. Log Index

To improve the efficiency of the seek operation of the log reader, an index of 
transaction log files have been added. This index enables to quickly lookup a 
tlog file based on a version number. This index is implemented by adding a 
version number to the tlog filename and by leveraging the file system index. 
This solution was choosen as it was simpler and more robust than managing a 
separate disk-based index.

h4. Number of Opened Files

TransactionLog has been extended to automatically (1) close the output stream 
when its refeference count reach 0, and (2) reopen the output stream on demand. 
The new tlog (the current tlog being written) is kept open at all time. When a 
transaction log is pushed to the old tlog list, its reference count is 
decremented, which might trigger the closing of the output stream. 
The output stream is reopened in two cases:
* during recovery, to write a commit to the end of an uncapped tlog file;
* when a log reader is accessing it.

At the moment, the logic is splitted into two classes (TransactionLog and 
CdcrTransactionLog). We should probably merge the two in the final version.

h4. Integration within the UpdateHandler

There is a nocommit in the UpdateHandler to force the instantiation of the 
CdcrUpdateLog instead of the UpdateLog. We need to decide how user will 
configure this and modify the UpdateHandler appropriately.


 Keep transaction logs around longer
 ---

 Key: SOLR-6460
 URL: https://issues.apache.org/jira/browse/SOLR-6460
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley
 Attachments: SOLR-6460.patch, SOLR-6460.patch, SOLR-6460.patch


 Transaction logs are currently deleted relatively quickly... but we need to 
 keep them around much longer to be used as a source for cross-datacenter 
 recovery.  This will also be useful in the future for enabling peer-sync to 
 use more historical updates before falling back to replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6460) Keep transaction logs around longer

2014-09-30 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6460:

Attachment: SOLR-6460.patch

A new patch that introduces efficient seeking over a list of transaction log 
files. Efficient seeking is achieved by adding metadata (version number) to 
tlog filename and by leveraging the filesystem's index. The tlog filename has 
now the following syntax: tlog.${logId}.${startVersion}.




 Keep transaction logs around longer
 ---

 Key: SOLR-6460
 URL: https://issues.apache.org/jira/browse/SOLR-6460
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley
 Attachments: SOLR-6460.patch, SOLR-6460.patch


 Transaction logs are currently deleted relatively quickly... but we need to 
 keep them around much longer to be used as a source for cross-datacenter 
 recovery.  This will also be useful in the future for enabling peer-sync to 
 use more historical updates before falling back to replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-6460) Keep transaction logs around longer

2014-09-30 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153430#comment-14153430
 ] 

Renaud Delbru edited comment on SOLR-6460 at 9/30/14 5:26 PM:
--

A new patch that introduces efficient seeking over a list of transaction log 
files. Efficient seeking is achieved by adding metadata (version number) to 
tlog filename and by leveraging the filesystem's index. The tlog filename has 
now the following syntax:
{noformat}
tlog.${logId}.${startVersion}
{noformat}




was (Author: rendel):
A new patch that introduces efficient seeking over a list of transaction log 
files. Efficient seeking is achieved by adding metadata (version number) to 
tlog filename and by leveraging the filesystem's index. The tlog filename has 
now the following syntax: tlog.${logId}.${startVersion}.




 Keep transaction logs around longer
 ---

 Key: SOLR-6460
 URL: https://issues.apache.org/jira/browse/SOLR-6460
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley
 Attachments: SOLR-6460.patch, SOLR-6460.patch


 Transaction logs are currently deleted relatively quickly... but we need to 
 keep them around much longer to be used as a source for cross-datacenter 
 recovery.  This will also be useful in the future for enabling peer-sync to 
 use more historical updates before falling back to replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6460) Keep transaction logs around longer

2014-09-24 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146174#comment-14146174
 ] 

Renaud Delbru commented on SOLR-6460:
-

Hi, 

here is an initial analysis and proposal of the modifications of the UpdateLog 
for the CDCR scenario.
Most of the original workflow of the UpdateLog can be left untouched. It is 
necessary however to keep the concept of maximum number of records to keep 
(except for the cleaning of old transaction logs) in order to not interfere 
with the normal workflow.

h4. Cleaning of Old Transaction Logs

The logic to remove old tlog files should be modified so that it relies on 
pointers instead of a limit defined by the maximum number of records to keep.
The UpdateLog should be the one in charge of keeping the list of pointers and 
of managing their life-cycle (or to deleguate it to the LogReader which is 
presented next). Such a pointer, denoted LogPointer, should be composed of a 
tlog file and of an associated file pointer.

h4. Log Reader

The UpdateLog must provide a log reader, denoted LogReader, that will be used 
by the CDC Replicator to search, scan and read the update logs. The LogReader 
will wrap a LogPointer and hide its management (e.g., instantiation, increment, 
release).

The operations that must be provided by the LogReader are:
* Scan: move LogPointer to next entry
* Read: read a log entry specified by the LogPointer
* Lookup: lookup a version number - this will be performed during the 
initialisation of the CDC Replicator / election of a new leader, therefore 
rarely.

The LogReader must not only read olf tlog files, but also the new tlog file 
(i.e., transaction log being written). This requires specific logic, since a 
LogReader can be exhausted at a time t1 and have new entries available at a 
time t2.

h4. Log Index

In order to support efficient lookup of version numbers across a large number 
of tlog files, we need a pre-computed index of version numbers across tlog 
files.
The index could be designed as a list of tlog files, associated with their 
lower and upper bound in term of version numbers. The search will then read 
this index to find quickly the tlog files containing a given version number, 
then read the tlog file to find the associated entry.
However, a single tlog file can be large in certain scenarios. Therefore, we 
could add another secondary index per tlog file. This index will contain a list 
of version, pointer pairs. This will allow the LogReader to quickly find an 
entry without having to scan the full tlog file. This index will be created and 
managed by the TransactionLog.
This secondary index however duplicates the version number for each log entry. 
A possible optimisation is to modify the format of the transaction log so that 
the version number is not stored as part of the log entry.

h4. Transaction Log

The TransactionLog class is opening the tlog file in the constructor. This 
could be problematic with a large numbers of tlog files, as it will exhaust the 
file descriptors. One possible solution is to create a subclass for read only 
mode that will not open the file in the constructor. Instead, the file will be 
opened and closed on-demand by using the TransactionLog#LogReader. 
The CDCR Update Logs will take care of converting old transaction log objects 
into a read-only version.
This has however indirect consequences on the initialisation of the UpdateLog, 
more precisely in the recovery phase (#recoverFromLog), as the UpdateLog might 
write a commit (line 1418) at the end of an old tlog during replaying.

h4. Integration within the UpdateHandler

We will have to extend the UpdateHandler constructor in order to have the 
possibility to switch the UpdateLog implementation based on some configuration 
keys in the solrconfig.xml file.


 Keep transaction logs around longer
 ---

 Key: SOLR-6460
 URL: https://issues.apache.org/jira/browse/SOLR-6460
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley

 Transaction logs are currently deleted relatively quickly... but we need to 
 keep them around much longer to be used as a source for cross-datacenter 
 recovery.  This will also be useful in the future for enabling peer-sync to 
 use more historical updates before falling back to replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6460) Keep transaction logs around longer

2014-09-24 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated SOLR-6460:

Attachment: SOLR-6460.patch

Here is a first patch with an initial implementation of the CdcrUpdateLog which 
includes:
* a cleaning of the old logs based on log pointers
* a log reader that reads both the old and new tlog files.
Many nocommit or todos, but this might provide enough materials for discussion.

 Keep transaction logs around longer
 ---

 Key: SOLR-6460
 URL: https://issues.apache.org/jira/browse/SOLR-6460
 Project: Solr
  Issue Type: Sub-task
Reporter: Yonik Seeley
 Attachments: SOLR-6460.patch


 Transaction logs are currently deleted relatively quickly... but we need to 
 keep them around much longer to be used as a source for cross-datacenter 
 recovery.  This will also be useful in the future for enabling peer-sync to 
 use more historical updates before falling back to replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6463) track update progress

2014-09-18 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139076#comment-14139076
 ] 

Renaud Delbru commented on SOLR-6463:
-

Hi Yonik, All,

Here is a proposal for tracking update progress in CDCR. In this scenario, we 
are assuming at the moment the Active-Passive scenario, where there is one 
source cluster forwarding updates to one or more target clusters. Looking 
forward to read your feedbacks on this proposal.

h4. Updates Tracking  Pushing

CDCR replicates data updates from the source to the target Data Center by 
leveraging the Updates Log. A background thread regularly checks the Updates 
Log for new entries, and then forwards them to the target Data Center. The 
thread therefore needs to keep a checkpoint in the form of a pointer to the 
last update successfully processed in the Updates Log. Upon acknowledgement 
from the target Data Center that updates have been successfully processed, the 
Updates Log pointer is updated to reflect the current checkpoint.

This pointer must be synchronized across all the replicas. In the case where 
the leader goes down and a new leader is elected, the new leader will be able 
to resume replication to the last update by using this synchronized pointer. 
The strategy to synchronize such a pointer across replicas will be explained 
next.

If for some reason, the target Data Center is offline or fails to process the 
updates, the thread will periodically try to contact the target Data Center and 
push the updates.

h4. Synchronization of Update Checkpoints

A reliable synchronization of the update checkpoints between the shard leader 
and shard replicas is critical to avoid introducing inconsistency between the 
source and target Data Centers. Another important requirement is that the 
synchronization must be performed with minimal network traffic to maximize 
scalability.

In order to achieve this, the strategy is to:
* Uniquely identify each update operation. This unique identifier will serve as 
pointer. 
* Rely on two storages: an ephemeral storage on the source shard leader, and a 
persistent storage on the target cluster.

The shard leader in the source cluster will be in charge of generating a unique 
identifier for each update operation, and will keep a copy of the identifier of 
the last processed updates in memory. The identifier will be sent to the target 
cluster as part of the update request. On the target Data Center side, the 
shard leader will receive the update request, store it along with the unique 
identifier in the Updates Log, and replicate it to the other shards.

SolrCloud is already providing a unique identifier for each update operation, 
i.e., a “version” number. This version number is generated using a time-based 
lamport clock which is incremented for each update operation sent. This 
provides an “happened-before” ordering of the update operations that will be 
leveraged in (1) the initialisation of the update checkpoint on the source 
cluster, and in (2) the maintenance strategy of the Updates Log.

The persistent storage on the target cluster is used only during the election 
of a new shard leader on the source cluster. If a shard leader goes down on the 
source cluster and a new leader is elected, the new leader will contact the 
target cluster to retrieve the last update checkpoint and instantiate its 
ephemeral pointer. On such a request, the target cluster will retrieve the 
latest identifier received across all the shards, and send it back to the 
source cluster. To retrieve the latest identifier, every shard leader will look 
up the identifier of the first entry in its Update Logs and sent it back to a 
coordinator. The coordinator will have to select the highest among them.

This strategy does not require any additional network traffic and ensures 
reliable pointer synchronization. Consistency is principally achieved by 
leveraging SolrCloud. The update workflow of SolrCloud ensures that every 
update is applied to the leader but also to any of the replicas. If the leader 
goes down, a new leader is elected. During the leader election, a 
synchronization is performed between the new leader and the other replicas. As 
a result, this ensures that the new leader has a consistent Update Logs with 
the previous leader. Having a consistent Updates Log means that:
* On the source cluster, the update checkpoint can be reused by the new leader.
* On the target cluster, the update checkpoint will be consistent between the 
previous and new leader. This ensures the correctness of the update checkpoint 
sent by a newly elected leader on the target cluster.

h6. Impact of Solr’s Update Reordering

The Updates Log can differ between the leader and the replicas, but not in an 
inconsistent way. During leader to replica synchronisation, Solr’s Distributed 
Update Processor will take care of reordering the update

[jira] [Created] (LUCENE-4919) IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-4919:
-

 Summary: IntsRef, BytesRef and CharsRef returns incorrect hashcode 
when filled with 0
 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3


IntsRef, BytesRef and CharsRef implementation does not follow the java 
Arrays.hashCode implementation, and returns incorrect hashcode when filled with 
0. 
For example, an IntsRef with { 0 } will return the same hashcode than an 
IntsRef with { 0, 0 }.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4919) IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4919:
--

Description: 
IntsRef, BytesRef and CharsRef implementation does not follow the java 
Arrays.hashCode implementation, and returns incorrect hashcode when filled with 
0. 
For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
IntsRef with \{ 0, 0 \}.

  was:
IntsRef, BytesRef and CharsRef implementation does not follow the java 
Arrays.hashCode implementation, and returns incorrect hashcode when filled with 
0. 
For example, an IntsRef with { 0 } will return the same hashcode than an 
IntsRef with { 0, 0 }.


 IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0
 

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3


 IntsRef, BytesRef and CharsRef implementation does not follow the java 
 Arrays.hashCode implementation, and returns incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4919) IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4919:
--

Attachment: LUCENE-4919.patch

Here is a patch for IntsRef, BytesRef and CharsRef including unit tests. The 
new hashcode implementation is identical to the one found in Arrays.hashCode.

 IntsRef, BytesRef and CharsRef returns incorrect hashcode when filled with 0
 

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation does not follow the java 
 Arrays.hashCode implementation, and returns incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4919:
--

Description: 
IntsRef, BytesRef and CharsRef implementation do not follow the java 
Arrays.hashCode implementation, and return incorrect hashcode when filled with 
0. 
For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
IntsRef with \{ 0, 0 \}.

  was:
IntsRef, BytesRef and CharsRef implementation does not follow the java 
Arrays.hashCode implementation, and returns incorrect hashcode when filled with 
0. 
For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
IntsRef with \{ 0, 0 \}.

Summary: IntsRef, BytesRef and CharsRef return incorrect hashcode when 
filled with 0  (was: IntsRef, BytesRef and CharsRef returns incorrect hashcode 
when filled with 0)

 IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
 ---

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation do not follow the java 
 Arrays.hashCode implementation, and return incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626454#comment-13626454
 ] 

Renaud Delbru commented on LUCENE-4919:
---

Hi Robert,

From my understanding, this applies only for BytesRef (even if this behavior 
sounds dangerous to me). However, why is IntsRef and CharsRef following the 
same behavior ?

 IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
 ---

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation do not follow the java 
 Arrays.hashCode implementation, and return incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626458#comment-13626458
 ] 

Renaud Delbru commented on LUCENE-4919:
---

I see that BytesRef is used a bit everywhere in various contexts, contexts 
which are different from the TermsHash context. This hashcode behavior might 
cause unexpected problems, as I am sure most of the users of BytesRef are 
unaware of this particular behavior.

 IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
 ---

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation do not follow the java 
 Arrays.hashCode implementation, and return incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626471#comment-13626471
 ] 

Renaud Delbru commented on LUCENE-4919:
---

Ok, I understand Robert. That sounds like a big task. I can try to make a first 
pass over it in the next days if you think it is worth it (personally I would 
feel more reassured knowing that the hashcode follows a more common behavior).

 IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
 ---

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation do not follow the java 
 Arrays.hashCode implementation, and return incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626477#comment-13626477
 ] 

Renaud Delbru commented on LUCENE-4919:
---

@Simon: I discovered the issue when using IntsRef. during query processing, I 
am streaming array of integers using IntsRef. I was relying on the hashCode to 
compute a unique identifier for the content of a particular IntsRef until I 
started to see unexpected results in my unit tests. Then I saw that the same 
behaviour is found in the other *Ref classes. 
I could live without it and bypass the problem by changing my implementation 
(and computing myself my own hash code). But I thought this behaviour is not 
very clear for the user, and could be potentially dangerous, and therefore good 
to share it with you.

 IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
 ---

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation do not follow the java 
 Arrays.hashCode implementation, and return incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626480#comment-13626480
 ] 

Renaud Delbru commented on LUCENE-4919:
---

Maybe a simpler solution would be to clearly state this behavior in all the 
methods javadoc.

 IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
 ---

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation do not follow the java 
 Arrays.hashCode implementation, and return incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626486#comment-13626486
 ] 

Renaud Delbru commented on LUCENE-4919:
---

I agree with you Dawid, but this particular behaviour increases the chance of 
getting the same hash for a certain type of inputs. Anyway, I think the general 
decision is to not change their hashCode behvaiour ;o), I am fine with it. Feel 
free to close the issue.
Thanks, and sorry for the distraction.

 IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
 ---

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation do not follow the java 
 Arrays.hashCode implementation, and return incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-4919) IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0

2013-04-09 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru closed LUCENE-4919.
-

Resolution: Not A Problem

 IntsRef, BytesRef and CharsRef return incorrect hashcode when filled with 0
 ---

 Key: LUCENE-4919
 URL: https://issues.apache.org/jira/browse/LUCENE-4919
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/other
Affects Versions: 4.2
Reporter: Renaud Delbru
 Fix For: 4.3

 Attachments: LUCENE-4919.patch


 IntsRef, BytesRef and CharsRef implementation do not follow the java 
 Arrays.hashCode implementation, and return incorrect hashcode when filled 
 with 0. 
 For example, an IntsRef with \{ 0 \} will return the same hashcode than an 
 IntsRef with \{ 0, 0 \}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking AttributeSource

2013-03-20 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607809#comment-13607809
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Thanks for committing this, Steve and Robert. That's great.

 Add create(AttributeFactory) to TokenizerFactory and subclasses with ctors 
 taking AttributeFactory, and remove Tokenizer's and subclasses' ctors taking 
 AttributeSource
 ---

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 5.0, 4.3

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
 LUCENE-4642.patch, 
 LUCENE-4642-single-create-method-on-TokenizerFactory-subclasses.patch, 
 LUCENE-4642-single-create-method-on-TokenizerFactory-subclasses.patch, 
 TrieTokenizerFactory.java.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826).  These should be removed.
 TokenizerFactory does not provide an API to create tokenizers with a given 
 AttributeFactory, but quite a few tokenizers have constructors that take an 
 AttributeFactory.  TokenizerFactory should add a create(AttributeFactory) 
 method, as should subclasses for tokenizers with AttributeFactory accepting 
 ctors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-03-11 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13598688#comment-13598688
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi Steve, I imagine things were busy these past days with the 4.2 release. 
Would you need help to finalise this patch ? thanks.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
 TrieTokenizerFactory.java.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-02-26 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13587269#comment-13587269
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi, any updates about the patch ? thanks.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
 TrieTokenizerFactory.java.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-02-14 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13578331#comment-13578331
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi, would this patch be considered for inclusion at some point in time ? Thanks.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
 TrieTokenizerFactory.java.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-02-03 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4642:
--

Attachment: LUCENE-4642.patch

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
 TrieTokenizerFactory.java.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-02-03 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569883#comment-13569883
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi,

I have submitted a patch which integrates:
- the patch from Uwe
- the removal of the Tokenizer(AttributeSource) constructor
- the addition of a TokenizerFactory.create(AttributeFactory) method
- some of the changes from the previous patch from Steve (e.g., 
TokenizerFactory.create method throw UOE by default)

All test suites are passing.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch, LUCENE-4642.patch, 
 TrieTokenizerFactory.java.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-28 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13564422#comment-13564422
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Great, I think that AttributeFactory hack could work for us. Would you agree to 
add a TokenizerFactory.create(AttributeFactory) method ? I could prepare a 
patch for that.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch, 
 TrieTokenizerFactory.java.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-27 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13563784#comment-13563784
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi Robert,

I understand your point of view. One possible alternative for simplifying the 
API would be to refactor constructors with AttributeSource/AttributeFactory 
into setters. After a quick look, this looks compatible with the existing 
tokenizers and tokenizer factories. 
The setting of AttributeSource/AttributeFactory for a tokenizer will be 
transparent (i.e., they do not have to explicitly create a constructor), and 
specific extension can be still implemented by subclasses (e.g., 
NumericTokenStream can overwrite the setAttributeFactory method to wrap a given 
factory with NumericAttributeFactory).
For the tokenizer factories, we can then implement a create method with an 
AttributeSource/AttributeFactory parameter, which will call the abstract method 
create and then call the setAttributeSource/setAttributeFactory on the newly 
created tokenizer.

What do you think ? Did I miss something in my reasoning which could break 
something ? 

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-25 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562740#comment-13562740
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi, 

are there still some open questions on this issue that block the patch of being 
committed ? 

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-25 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562853#comment-13562853
 ] 

Renaud Delbru commented on LUCENE-4642:
---

@steve:

{quote}
have you looked at TeeSinkTokenFilter
{quote}

Yes, and from my current understanding, it is similar to our current 
implementation. The problem with this approach is that the exchange of 
attributes is performed using the AttributeSource.State API with 
AttributeSource#captureState and AttributeSource#restoreState, which copies the 
values of all attribute implementations that the state contains, and this is 
very inefficient as it has to copies arrays and other objects (e.g., char term 
arrays, etc.) for every single token.

@robert:

Concerning the problem of UOEs, the new patch of Steve reduces the number of 
UOEs to one only, which is much more reasonable than my first approach. I have 
looked at the current state of the Lucene trunk, and there are already a lot of 
UOEs in many places. So, I would suggest that this problem may not be a 
blocking one (but I might be wrong).

Concerning the problem of constructor explosion, maybe we can find a consensus. 
Your proposition of removing Tokenizer(AttributeSource) cannot work for us, as 
we need it to share a same AttributeSource across multiple streams. However, as 
I proposed, removing the Tokenizer(AttributeFactory) could work as it could be 
emulated by using Tokenizer(AttributeSource).



 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-21 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558832#comment-13558832
 ] 

Renaud Delbru commented on LUCENE-4642:
---

{quote}
Personally: I think we should remove Tokenizer(AttributeSource): it bloats the 
APIs and causes ctor explosion.
{quote}

Why not the contrary instead ? I.e., remove Tokenizer(AttributeFactory) and 
leave Tokenizer(AttributeSource) since AttributeFactory is an enclosed class of 
AttributeSource ? Limiting the API to only AttributeFactory will restrict it 
unnecessarily imho.

Our use case is to be able to create advanced token streams, where one 
parent token stream can have multiple child token streams, the parent token 
stream will share their attribute sources with the child token streams for 
performance reasons. Emulating this behaviour by doing copies of the attributes 
from stream to stream is really ineffective (our throughput is divided by at 
least 3).
A more concrete use case is the ability to create specific token streams for 
a particular token type. For example, our parent tokenizer tokenizes a string 
into a list of tokens, each one having a specific type. Then, each token is 
processed downstream by child token streams. The child token stream that will 
process the token depends on the token type attribute.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-21 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558850#comment-13558850
 ] 

Renaud Delbru commented on LUCENE-4642:
---

{quote}
Because its totally unrelated.
{quote}

Well, I think the user could simply create a new AttributeSource with a given 
AttributeFactory to emulate the Tokenizer(AttributeFactory) ? But that might 
add some burden on the user side.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
Assignee: Steve Rowe
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch, LUCENE-4642.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-16 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13555146#comment-13555146
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Could someone from the team tell us if this patch may be considered for 
inclusion at some point ? We currently need it in our project, and therefore it 
is kind of blocking us in our development. Thanks.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2013-01-02 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13542144#comment-13542144
 ] 

Renaud Delbru commented on LUCENE-4642:
---

Hi,

Any plan to commit this patch ? Or is there additional work to do before ?

thanks

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
  Labels: analysis, attribute, tokenizer
 Fix For: 4.2, 5.0

 Attachments: LUCENE-4642.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2012-12-20 Thread Renaud Delbru (JIRA)
Renaud Delbru created LUCENE-4642:
-

 Summary: TokenizerFactory should provide a create method with a 
given AttributeSource
 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
 Fix For: 4.1


All tokenizer implementations have a constructor that takes a given 
AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory does 
not provide an API to create tokenizers with a given AttributeSource.

Side note: There are still a lot of tokenizers that do not provide constructors 
that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4642) TokenizerFactory should provide a create method with a given AttributeSource

2012-12-20 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4642:
--

Attachment: LUCENE-4642.patch

Patch adding #create(AttributeSource source, Reader reader) to the 
TokenizerFactory class and to all its subclasses.

Given there are a lot of tokenizers that do not have constructors that take a 
given AttributeSource, I have implemented the new create method for their 
respective factory which throws a UnsupportedOperationException.

 TokenizerFactory should provide a create method with a given AttributeSource
 

 Key: LUCENE-4642
 URL: https://issues.apache.org/jira/browse/LUCENE-4642
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.1
Reporter: Renaud Delbru
  Labels: analysis, attribute, tokenizer
 Fix For: 4.1

 Attachments: LUCENE-4642.patch


 All tokenizer implementations have a constructor that takes a given 
 AttributeSource as parameter (LUCENE-1826). However, the TokenizerFactory 
 does not provide an API to create tokenizers with a given AttributeSource.
 Side note: There are still a lot of tokenizers that do not provide 
 constructors that take AttributeSource and AttributeFactory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-12 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4613:
--

Attachment: LUCENE-4613.patch

A first refactoring to try to keep backward compatibility of the 
{{CompressingCodec#randomInstance(Random random)}}. Let me know if this is good 
enough. Tests are passing, as well as the specific TestIndexFileDeleter test 
case you previously reported.

 CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
 ---

 Key: LUCENE-4613
 URL: https://issues.apache.org/jira/browse/LUCENE-4613
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs
Affects Versions: 4.1
Reporter: Renaud Delbru
 Fix For: 4.1

 Attachments: LUCENE-4613.patch, LUCENE-4613.patch, LUCENE-4613.patch


 If the writing is aborted, CompressingStoredFieldsWriter does not remove 
 partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-11 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13528847#comment-13528847
 ] 

Renaud Delbru commented on LUCENE-4613:
---

Ok, I'll upload something today.

 CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
 ---

 Key: LUCENE-4613
 URL: https://issues.apache.org/jira/browse/LUCENE-4613
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs
Affects Versions: 4.1
Reporter: Renaud Delbru
 Fix For: 4.1

 Attachments: LUCENE-4613.patch


 If the writing is aborted, CompressingStoredFieldsWriter does not remove 
 partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4613) CompressingStoredFieldsWriter ignores the segment suffix if writing aborted

2012-12-11 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-4613:
--

Attachment: LUCENE-4613.patch

New path with a unit test that checks that partially written files are removed 
if writing abort. 
I had to modify a bit the API of CompressingStoredFieldsFormat to make the test 
possible. Also, CompressingCodec is now always adding a segment suffix. We 
might be bale to improve this by adding or not randomly a segment suffix.

 CompressingStoredFieldsWriter ignores the segment suffix if writing aborted
 ---

 Key: LUCENE-4613
 URL: https://issues.apache.org/jira/browse/LUCENE-4613
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs
Affects Versions: 4.1
Reporter: Renaud Delbru
 Fix For: 4.1

 Attachments: LUCENE-4613.patch, LUCENE-4613.patch


 If the writing is aborted, CompressingStoredFieldsWriter does not remove 
 partially-written files as the segment suffix is not taken into consideration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >