hi Tobias

On 01/18/2012 10:55 PM, Tobias Wunden wrote:
Hi David,

On our recently upgraded system (1.3.x) I'm seing repeated failures to distribute, the 
workflow just stalls on the "Distributing media to progressive downloads"
where is matterhorn trying to copy the files to? Is it a network share or a 
local disk? Does ist start copying the files, and if so, do you see the file 
size increase?


The share volume is a single NFS volume mounted on the admin node so hard linking is enabled



I've so far tried:
1) Not having the workers talk to the admin via mod_proxy
2) Doubled the memory on the admin.
I am wondering why you are thinking that it could be a memory issue? The 
distribution service first copies the files to the local workspace and then 
copy the files to the final destination using either file channels (uses the 
native os for copying) or in chunks of 1 MB.


I had seen an out of memory error on the admin node, and since have been able to reproduce this. THe first run with yourkit also seems to indicate something is not right memory usage wise. Will post more when I have an analysis



Before I did 1. I did seem messages that "distributing file x timed out. retrying in 
yms" - but never saw any evidence of a retry. The last job after I made change 1. 
just failed after putting all 3 nodes in the cluster under heavy load.
That is *really* strange. I tried to find that log message (or similar ones) in 
the codebase without any luck. Do you think you could possibly get the correct 
message from the logs?


2012-01-18 21:01:44 INFO (TrustedHttpClientImpl:338) - Sleeping 557459ms before trying request http://media.uct.ac.za:8080/distribution/streaming/dispatch again due to a HTTP/1.1\
 401 Nonce has expired/timed out

On the admin node


The media packages in question are in the region of 1.2Gb and contain just over 
an hour of recordings. Smaller media packages have no issues, this leads me to 
suspect that there is a regression in this service that is doing something like 
reading the files into memory.
A timeout seems likely. What could be possible is that we are hitting a bug 
with downloading large files from the working file repository. Do you have the 
working file repository root configured to the same share as the workspace root 
(i. e. is hard linking enabled)?


Yes

Thanks,
Tobias
_______________________________________________
Matterhorn mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn


To unsubscribe please email
[email protected]
_______________________________________________

_______________________________________________
Matterhorn mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn


To unsubscribe please email
[email protected]
_______________________________________________

Reply via email to