[jira] [Comment Edited] (SLING-3967) Define replication strategy for big trees

2018-01-19 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/SLING-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332072#comment-16332072
 ] 

Dirk Rudolph edited comment on SLING-3967 at 1/19/18 10:41 AM:
---

Another thing that counts in here is guarantee of processing. With introduction 
of supporting other systems then sling and with supporting any customisation in 
the serialisation of DistributionRequests the time and resources taken to 
export a "package" in whether format this will be is undefined. So the bigger 
the DistributionRequests are the more likely it might become that creating the 
package already fails - which depending on the trigger used might cause the 
loss of the entire request. 

For example. Using SCD to index solr for binary documents might require parsing 
them with tika and sending only the plaintext as result. Depending on documents 
to distribute it might make sense to split the DistributionRequest to achieve 
an approximated mean of package creation time.

This can probably be done on DistributionPackageExporter level (not taking deep 
paths into account) or on DistributionPackageBuilder level which would require 
an API change to be made. 


was (Author: diru):
Another thing that counts in here is guarantee of processing. With introduction 
of supporting other systems then sling and with supporting any customisation in 
the serialisation of DistributionRequests the time and resources taken to 
export a "package" in whether format this will be is undefined. So the bigger 
the DistributionRequests are the more likely it might become that creating the 
package already fails - which depending on the trigger used might cause the 
loss of the entire request. 

For example. Using SCD to index solr for binary documents might require parsing 
them with tika and sending only the plaintext as result. Depending on documents 
to distribute it might make sense to split the DistributionRequest to achieve 
an approximated mean of package creation time.

This looks like a change being necessary in the 
LocalDistributionPackageExporter only if deep paths are not taken into account.

> Define replication strategy for big trees
> -
>
> Key: SLING-3967
> URL: https://issues.apache.org/jira/browse/SLING-3967
> Project: Sling
>  Issue Type: Improvement
>  Components: Content Distribution
>Reporter: Marius Petria
>Priority: Major
>
> An extreme case for replication is the replication of an entire big tree (for 
> example a /content/bigtree/* with GBs of content).
> One should be able to define a way to replicate this in smaller packages such 
> that it does not creates too big packages that affect performance.
> Options to do the split:
> - number of nodes (every 100 nodes)
> - size of data (every 100 MB)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SLING-3967) Define replication strategy for big trees

2018-01-19 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/SLING-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332072#comment-16332072
 ] 

Dirk Rudolph edited comment on SLING-3967 at 1/19/18 10:25 AM:
---

Another thing that counts in here is guarantee of processing. With introduction 
of supporting other systems then sling and with supporting any customisation in 
the serialisation of DistributionRequests the time and resources taken to 
export a "package" in whether format this will be is undefined. So the bigger 
the DistributionRequests are the more likely it might become that creating the 
package already fails - which depending on the trigger used might cause the 
loss of the entire request. 

For example. Using SCD to index solr for binary documents might require parsing 
them with tika and sending only the plaintext as result. Depending on documents 
to distribute it might make sense to split the DistributionRequest to achieve 
an approximated mean of package creation time.

This looks like a change being necessary in the 
LocalDistributionPackageExporter only if deep paths are not taken into account.


was (Author: diru):
Another thing that counts in here is guarantee of processing. With introduction 
of supporting other systems then sling and with supporting any customisation in 
the serialisation of DistributionRequests the time and resources taken to 
export a "package" in whether format this will be is undefined. So the bigger 
the DistributionRequests are the more likely it might become that creating the 
package already fails - which depending on the trigger used might cause the 
loss of the entire request. 

For example. Using SCD to index solr for binary documents might require parsing 
them with tika and sending only the plaintext as result. Depending on documents 
to distribute it might make sense to split the DistributionRequest to achieve 
an approximated mean of package creation time.

This looks like a change being necessary in the 
LocalDistributionPackageExporter only.

> Define replication strategy for big trees
> -
>
> Key: SLING-3967
> URL: https://issues.apache.org/jira/browse/SLING-3967
> Project: Sling
>  Issue Type: Improvement
>  Components: Content Distribution
>Reporter: Marius Petria
>Priority: Major
>
> An extreme case for replication is the replication of an entire big tree (for 
> example a /content/bigtree/* with GBs of content).
> One should be able to define a way to replicate this in smaller packages such 
> that it does not creates too big packages that affect performance.
> Options to do the split:
> - number of nodes (every 100 nodes)
> - size of data (every 100 MB)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SLING-3967) Define replication strategy for big trees

2018-01-19 Thread Dirk Rudolph (JIRA)

[ 
https://issues.apache.org/jira/browse/SLING-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16332072#comment-16332072
 ] 

Dirk Rudolph edited comment on SLING-3967 at 1/19/18 10:20 AM:
---

Another thing that counts in here is guarantee of processing. With introduction 
of supporting other systems then sling and with supporting any customisation in 
the serialisation of DistributionRequests the time and resources taken to 
export a "package" in whether format this will be is undefined. So the bigger 
the DistributionRequests are the more likely it might become that creating the 
package already fails - which depending on the trigger used might cause the 
loss of the entire request. 

For example. Using SCD to index solr for binary documents might require parsing 
them with tika and sending only the plaintext as result. Depending on documents 
to distribute it might make sense to split the DistributionRequest to achieve 
an approximated mean of package creation time.

This looks like a change being necessary in the 
LocalDistributionPackageExporter only.


was (Author: diru):
Another thing that counts in here is guarantee of processing. With introduction 
of supporting other systems then sling and with supporting any customisation in 
the serialisation of DistributionRequests the time and resources taken to 
export a "package" in whether format this will be is undefined. So the bigger 
the DistributionRequests are the more likely it might become that creating the 
package already fails - which depending on the trigger used might cause the 
loss of the entire request. 

For example. Using SCD to index solr for binary documents might require parsing 
them with tika and sending only the plaintext as result. Depending on documents 
to distribute it might make sense to split the DistributionRequest to achieve 
an approximated mean of package creation time.

> Define replication strategy for big trees
> -
>
> Key: SLING-3967
> URL: https://issues.apache.org/jira/browse/SLING-3967
> Project: Sling
>  Issue Type: Improvement
>  Components: Content Distribution
>Reporter: Marius Petria
>Priority: Major
>
> An extreme case for replication is the replication of an entire big tree (for 
> example a /content/bigtree/* with GBs of content).
> One should be able to define a way to replicate this in smaller packages such 
> that it does not creates too big packages that affect performance.
> Options to do the split:
> - number of nodes (every 100 nodes)
> - size of data (every 100 MB)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)