On Tue, 15 Nov 2016 15:22:09 -0600 Derek Foreman <der...@osg.samsung.com> said:
> On 12/11/16 09:31 PM, Carsten Haitzler (The Rasterman) wrote: > > hmmmm. i'm reading this code and it seems broken. > > > > let's start with the simple one. > > > > 1) > > ecore_evas_wayland_common.c: this shouldnt MESS with manual render at all. > > at most it should just have > > > > if (wdata->reset_pending) return 0; > > > > added to the top "not going to do this" section in > > _ecore_evas_wl_common_render () and remove all the other manual render > > things. ecore evas already forces all rendring to be driven by an animator > > out of the box. this animator should either be totally driven by frame > > request events from wayland (compositor) itself OR at least hook it up to > > the drm device... but either way this code is wrong as it messes ith the > > manual render flag that is ALSO altered via api from ecore_evas "public > > api" calls which are called by elm win etc. > > > > so this has to go... it's wrong at a minimum and likely causing issues. > > It's known broken and definitely needs to be fixed, as discussed in > other email. I'll give it a higher priority. as long as you know. but as there was a small cluster-bomb of bugs i had to disable this in cae it was part of that pipeline. it was just a local change for testing not even git committed locally. > > 2) > > ecore_evas_animator_tick() has a region but NO TIMESTAMP. please look at > > ecore_x_vsync.c and specifically ecore_loop_time_set() that uses the vblank > > timestamp to get EXACT vblank timing into. this is a must for having a > > STEADY animator with STEADY timepoints. a region - fine... but you need a > > timestamp too. > > Fixed in git. good. :) > > now to the nasty one... > > > > 3) > > ecore evas drm engine using pageflips for animation. errrr... WHA? no. well > > ok. tell me if i'm wrong. to be able to tick the animator the code DOES a > > page flip... THEN uses the page flip event to drive the animator? what? > > this is just wrong on so many levels. animators are not a RESULT of > > altering the fb from some previous page flip. they are a tick that says > > "the screen scanout is refreshing NOW and you should time your animations > > NOW to THIS point". i suspect this code is a huge source of brokenness. why > > are we scheduling a flip ... we don't even HAVE a new frame yet? > > Usually we have a frame before we schedule because the first frame is > frequently drawn before evas starts to "overtick"... > > drmModePageFlip(same buffer currently on screen) > is similar in behaviour to > drmWaitVBlank() > > in that it completes at the same with the same timestamp. However once > you start the page flip you can't do another one until it completes. so... are you telling me... if we do a "null" flip (buffer flip to itself - ie current front buffer) we are unable to then before our scanout time is done to queue a flip to a new frame? we're stuck flipping to nothing? > What would happen if we didn't have a new frame before calling > drmWaitVBlank()? We wouldn't do anything until the following tick. > These two methods should be equivalent at the start of animation? yup. BUT after this frame the vblank one will tick every single frame irrespective of us having a buffer to display or not. > i know i know... it's "simpler" > > and you can get a flip event per output... BUT firstly the timestamp is not > > even USED from the page flip event in the code i'm reading > > Fixed in git by cedric. > > , secondly i have no > > idea WHAT timestamp this would be (i know the vblank events are the vblank > > interrupt timestamp). the pageflip COULD be some other time during scanout > > (end of vblank, end of scanout? something else?). > > No, it isn't. it's the same timestamp as vblank. Exactly the same > timestamp you'd have received from drmWaitVBlank well i was asking what it was. :) what could it be? :) > but either way it's not used and > > should absolutely be. but ... in order to GET an event you have to do a > > flip... and i'm reading ecore_drm2_fb_flip() and that is seemingly what is > > being done to just get this event. it seems to be able to do a "null > > flip" (flip the current buffer to itself), but i'm not so sure this is > > reliable given the api and code - is output-next always null when we > > register an animator with _drm_animator_register() ? > > output->next will likely go away. It only exists for when render > updates come in faster than we draw them - something that should not occur. well shouldn't occur unless we are benchmarking. then it should work still. > > what happens when we do a null page flip THEN do a real one? > > You can't do a page flip while a page flip is running - so trying to > flip again will set that frame as output->next. so you can alter what to flip any time... but when flip time comes it'll flip to whatever was last flipped to? > I haven't checked recently if we're still seeing any cases where a flip > is requested during a flip. If there are no more bugs causing that to > occur then output->next can go. we will have to handle this. benchmarking ... AND the 60 -> 30hz issue. > It's currently there to ensure that if we get an "extra" frame (ie: one > that asynchronously arrived and wasn't from a tick), and no following > frame is provided, then the frame isn't lost. ie: last frame before the > animator stops triggered asynchronously, and if we didn't make sure it > was presented then the user could be staring at a stale frame indefinitely. > > > also here comes the stinker. what happens when we MISS a whole frame because > > something went slow? we're scheduling from the mainloop. since we are asking > > for the NEXT animator tick from a page flip... we only schedule a new page > > flip once we already have figured something changed in the ecore_evas or the > > animator is registered etc... imagine for a second we have a 16ms frame gap > > (approx 60hz). what if we spend 18ms this frame rendering or doing > > something? we MISS the next opportunity to page flip, so we actually have > > to wait another 14ms for the next frame. thus we do either 60hz, or 30hz... > > but can never play catch up and squeeze in another frame because we missed > > the opportunity to schedule one. this scheme using page flips is broken "by > > design" in this way. > > I can't follow you here - are you asking for tearing or are you asking > for incorrect timing? no. i'm saying we should sit and then sleep dropping to 30hz. imagine something consistently takes 18ms to render. your framerate will BE 30hz. you can never run at 55hz which is what you should be able to. yes - on a 60hz refresh this does stutter but its what people WANT. there is a bug report from the guys working on blink + efl where we had EXACTLY this issue because the vblank based ticker for x was scheduling inline exactly like drm is now with page flips. it's a bug report from a heavy user who NEED and WANT to get all the frames they can even if they are uneven. i agree with them. it's a bug. > Assuming page flips for presentation (ie: no tearing, ever), if the > frame started at vblank N is not ready for scanout at N+1 then it can > not be display until N+2 without tearing. There will be an unavoidable > stutter in framerate, as we have nothing new for N+1. correct. > If I understand you correctly, you then want to use N+1's time as a tick > then we're going to deliver that frame at N+3, so we've just ticked with > an incorrect time, but we gave it a lot of slack time to render with. correct. > If frame times are consistently over the line then we'll have a juddered > output which is likely more annoying to the average user than a > consistent run of 30fps, and we'll have added 16ms of input latency to > most, but not all, frames. if you are "just over the line" you will add 16ms latency ANYWAY no matter what every single frame. so this is a fact of life. if we could PREDICT how long the frame would take to render and knew when it would be presented then i'd switch to using the predicted time instead as the timestamp for the animator (current loop time). but we cant sensibly do this atm so we don't. as above. i know of bug reports because of this exact behavior and my solution was "tick from a thread so you never miss a frame tick. it's up to the mainloop to try and render as fast as it can to keep up or sample from that even if the sampling is not evenly spaced". at least then its an easy matter of policy changes as we have all the data and can choose to use a subset. but if frame ticks are scheduled from main loop we literally LOSE data and no longer have it (lose ticks) and are unable to do anything about it. > If we only see intermittent spikes (say, render takes 10ms, 21ms, 10ms, > 10ms) then we've merely added a timing error. if you miss a frame you will have a timing error no matter what. > We only get to "catch up" if we have something like 10ms, 19ms, 19ms, > 10ms, as frame 3 is rendering (with a wrong time) during the missed > opportunity to present frame 2, but it gets enough time to present. correct. but the current scheme doesnt allow catch-up at all. you cant do 45hz or 55hz. see above bug report. game developers and players hate it. they want every frame they can get if its evenly spaced or not. > If we get 10ms, 18ms, 4ms, 10ms render times, then frame 3 is ready to > display before frame 2 actually hits scanout. That was cpu time/battery > well spent. > > This isn't really hiding the inability of the render engine to present > frames fast enough, it's just making it manifest in a different way - > and with more complicated code. it means your benchmarks come up at either 30hz, 20hz, 15hz or 60hz. they never can do 55hz or 45hz or 58hz. > If I've missed your point and you're actually asking for frame > presentation at vblank N+1.5 that's obviously going to come with a > jarring visual artifact and likely be impossible to do with atomic plane > setting. you got the point. there are bug reports because of exactly this and i fixed it a while back with a dedicated thread. this will need fixing too here otherwise the same bug report will happen here as well. > > please have a look at ecore_x_vsync.c. yes it's complex. it does > > whitelisting of working drivers and handles some broken drm info stuff. it > > dlopens libdrm to not have to link or compile against it at all. it also > > handles timeouts on drm vblank events and after enough will switch to time > > based animation. but ignore that... > > I have looked at that file in the past, it's very hard to read. It > would be helpful if there were comments indicating why it does things > like check if a timestamp is in the future. dealing with the case where monotonic clocks dont work (monotonic will be less than real data - at least on todays machines where 0 == jan 1 1970), so if this is using the gettimeofday as a fallback - deal with it and make the timestamp work in gettimeofday timestamps. > > it uses a dedicated thread JUST for scheduling vblank events and reading > > them so it will never miss one. in the dedicated thread ALL it does is > > schedule a vblank and sleep waiting for it to happen (or a command from > > mainloop to stop doing events, timeout in case of screen going off to fake > > animation ticks while vblanks are mising or exit etc.). so it can never (on > > even a heavily loaded system and irrespective of what the mainloop does) > > miss a frame... unless it > > I'm not sure I get your definition of "miss a frame" - you have an 80ms > render stall on RPI3, is that not several missed frames? that'd be "miss a frame" - miss several actually. > > literally cannot manage to wakeup from the fd coming alive. call > > drmHandlEvent, write the timestamp down an fd, check cancellation flag, the > > come back to scheduling it again within 16ms. you'd have a fundamentally > > broken system if this was not possible as this thread does nothing else. > > there is a good reason it is done this way. > > > > i have found that we HAVE managed to lose frames and drop from 60 to 30hz > > and the effects are not pretty. the drm engine really has to do the same. > > and i'm not so sure you can use page flip events then as you have a > > completely independent thread trying to use the drm fd to get these events > > and you don't want IT flipping pages to the same or a new one... so you > > likely have to use a separate drm fd like ecore_x_vsync.c does and rely on > > vblank events which are passive. > > This is a huge amount of complexity to implement a very odd timing > mechanism. :( it's necessary to not miss frames when scheduling animators. i literally watched this phenomenon happen on my screen for a while and was wondering what was up - it was this "something makes us miss frames often enough to instead of go at 50 or 55hz that i'd expect to drop all the way down to 40hz consistently for a while". it was a horrible effect. the thread cured it. > > chances are this code should become common/shared. but what i am reading > > right now smells of broken... and on my rpi i see broken... in fact the > > mouse just doesn't ever (barely) render/move unless i hold it over an ibar > > icon that is pulsing and animating... then it works, but move my mouse > > away... it basically > > That bug existed from Thursday last week to Monday this week. i got unlucky. :) but it was bad. :) > > doesn't work. i suspect this pageflip "to myself" trick is to blame. yes the > > vblank code only keys off a single screen and will ignore all other vblank > > events. i guess this would have to be figured out somehow. > > > > On my RPI3 weston can't update the cursor plane at > 30fps when gl > rendering, yet somehow manages that ok when run with --use-pixman > (software render, no GL). That's just moving the cursor plane around - > in neither case are any actual pixels being rendered! > > --use-pixman is also massively faster when resizing a window. > (somehow weston-simple-egl manages 60fps if it's flipped directly to > scanout though...) > > gl rendering sometimes draws flickers of crap in the cursor plane if any > pixels are actually rendered. > > I'm using raspbian, and haven't done anything special except enable the > vc4 mesa driver. > > I really don't think there's something we're doing to cause these > performance drops you're seeing. There's just something wrong with the > rpi3/raspbian. You're going to see broken on rpi3 because rpi3 is > broken. It's as simple as that. well actually not quite true. the "timers and drm were working at the same time" to cause buffer age switches frame by frame was something i noticed on the pi due to its slowness. this wanst a pi break. the "1fps" issue over the weekend wasn't a pi/raspbian brokenness. the current "you get 60 or 30hz" isnt a pi brokenness. it helps bring it out. the 80ms/frame thing seems to be something on the pi going really slow. i dont know what it is yet but its something gl related i'm pretty sure. i'd like to know if it is our use of client side buffers in drawarrays and the cpu stalling (and maybe using vbos with a queue of them so we can avoid mapping in-use vbo's and pick an old vbo from a pool) would help mitigate such stalls? is it because we do a texture upload mid-render and a stall? the pi is slow enough to really bring out badness easily and that is why i like it. these stalls will exist elsewhere and may simply be lower impact. the pi is just a great way to see the issues in big blinking neon lights. > How about we wait until the platform actually works before trying to > optimize for it? it indeed may have issues. but so do we. we have people who've complained that e uses too much cpu when compositing. we have people asking us to reduce our overhead when scrolling lists around. i'm looking into what we can do to reduce that overhead on low end systems. it matters. -- ------------- Codito, ergo sum - "I code, therefore I am" -------------- The Rasterman (Carsten Haitzler) ras...@rasterman.com ------------------------------------------------------------------------------ _______________________________________________ enlightenment-devel mailing list enlightenment-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/enlightenment-devel