On 11/06/2014 13:59, Roy Spliet wrote:
Dear Mr. Dew,
I hereby wish to propose the X.org EVoC project "REclock -
Reverse-engineer and implement NVA3/5/8 Voltage- and Frequency Scaling
in Nouveau" for which I am willing to participate, and apply for the
associated funding. Full details below or on
http://nouveau.spliet.org/evoc.html . For any further questions feel
free to contact me either on Freenode IRC (rspliet) or by e-mail to
this address.
Thank you for your consideration, and I look forward to hearing more
from you soon. Yours,
Roy Spliet
Hello Roy,
Thank you for your proposal. After careful consideration from the board
of directors, we accepted it. You may start your EVoC on June 16th.
Our treasurer will contact you in private to get your banking
information and send you an initial payment along with 250€ for buying
the hardware you need.
We wish you and your mentor the best of luck on this project! Do not
hesitate to contact us if you have any question.
Martin Peres, on behalf of the board of directors
---
REclock: Reverse-engineer and implement NVA3/5/8 Voltage- and
Frequency Scaling in Nouveau
NVIDIA graphics cards often support running at a variety of different
performance "levels". This aids in reducing the power demand and heat
dissipation of the devices when idle, while unleashing full potential
under load. A performance level comprises the clock speed and voltage
for several subcomponents in the GPU. The difference between the
lowest and highest performance level can be as much as a factor 10 in
clock speed.
Despite hard work from many developers, reclocking support in Nouveau
still has quite a few loose ends: engine reclocking is mostly in place
but not always reliable, there are several missing routines related to
memory reclocking and in general the actions required to perform
voltage- and frequency scaling are not or only partially understood.
Because of this, NVIDIA GPUs driven by nouveau are limited to using
the boot speed and voltage only, severely limiting performance and
usability.
For this project, I aim to tie these loose ends together for NVIDIAs
NVA3/5/8 GPUs. I intend to fully reverse engineer several
subcomponents related to voltage and frequency scaling, try to get a
full understanding of the clock tree and use this gained knowledge to
further improve the nouveau voltage and frequency scaling
implementation for said GPUs.
Personal information
My name is Roy Spliet, I'm a graduated masters student from Delft
University of Technology (TU Delft), planning to continue my academic
career as a PhD student in computer architecture. My background
includes kernel/driver development (nouveau, LITMUS^RT) and GPGPU
programming in OpenCL.
Previous involvement in Nouveau has led to successfully
reverse-engineering and implementing reclocking support for the
memory-less NVIDIA NVAA and NVAC chipsets, alongside many
contributions to memory reclocking for pre-NVC0 (Fermi) GPUs. For more
details about my personal background, please consult
http://roy.spliet.org.
Background
NVIDIA GPUs feature a complex multi-layer clock tree that allows for
per-subcomponent alteration of clock speeds. The precise clock tree is
a complex network consisting of one or more input clocks, several
fixed dividers, and a lot of routing to distribute these clocks to
every subcomponent. On the last level there is usually a Phase-Lock
Loop (PLL) that can take either the original clock or one of several
divided clocks as an input, and bring this clock up to the desired
level for the associated subcomponent. Control registers alter the
precise input of these PLLs, and can in addition be configured to
bypass the PLLs.
The video BIOS (VBIOS) provides two services: it takes care of
bringing the GPU in to an initial valid state, and it contains crucial
information regarding reclocking. Most importantly, the VBIOS
describes the ranges of each PLL in the system. On a higher level, the
VBIOS also contains several "performance levels". Each level consists
of a clock speed for each subcomponent. NVIDIA's driver switches
between these performance levels based on the load. For most engines
this routine consists of bypassing the PLL, setting it to a new value,
testing the newly set values, and then re-enabling the PLL.
Memory reclocking
Memory reclocking is a bit more difficult than other engines. Besides
an input clock, the memory controller also needs to know of a variety
of latencies, that are usually defined in clock ticks but mandated in
nanoseconds. These latencies, or timings, are described in the VBIOS.
To keep the memory controller and the engines running in sync, a form
of link training is also required. Updating all this information must
be done according to strict timing requirements, and failure to meet
these deadlines results in corrupted memory and all consequences
associated. Although the memory is often well documented in the
public, NVIDIA's memory controller is not. Reverse engineering it is a
difficult challenge, as there is very little feedback beyond either a
working system or a complete crash.
Reclocking engine
To facilitate the action of reclocking from within the GPU itself,
increasing stability on operating system failures, NVIDIA added a
subcomponent called PDAEMON. This component has full access to many
registers accessible through MMIO, including the registers controlling
the clocks, latencies and other power-management related features.
PDAEMON is a programmable engine supporting the Falcon or fμc ISA.
NVIDIA's driver uploads the firmware for this engine, dubbed PMU.
PMU is responsible for many power-management related functions,
including: monitor temperature, control fan speed and monitor the load
on the GPU. To alter clock speeds, the NVIDIA driver can upload
special scripts in a language called "seq" that will be interpreted by
PMU. These scripts contain sequences of registers that need to be
adjusted in order, along with required pause commands and other logic.
Full understanding of the seq ISA gives full understanding of the
actions executed by NVIDIA's driver on a reclock operation and their
timing.
Nouveau has it's own implementation of the PMU microcode, including a
scriptable engine offering many of the capabilities implemented in
older hardware. However, it's capabilities might be insufficient to
perform all the tasks that NVIDIA's driver performs through PMU.
Current state
Nouveau has a lot of code in place for engine reclocking. Many of the
PLLs have been identified, and some of the control registers have been
reverse engineered either partially or completely. Although known to
work on some GPUs, engine reclocking does not work reliably at least
on my NVA8.
For memory reclocking, some code exists to determine the latencies
that the memory and the memory controller need to know. Still, there
are some other features vital for memory reclocking that are
ill-understood, unimplemented and/or incorrect. In addition, the order
of events is likely wrong. As a result, clocking memory to any
performance level higher than the boot clocks likely results in memory
corruption. The link training unit found on some GPUs with DDR3 is one
important example of a feature not handled by Nouveau currently.
Large parts of the VBIOS are well understood and parsed both by the
nouveau kernel driver and the envytools VBIOS parsing tool. Any bits
left could lead to interesting clues on actions required for reclocking.
Project
Scope
In this project I aim to get a better understanding of the reclocking
features of the NVA3/5/8, as utilised by NVIDIA's official device
driver. The eventual goal of this project is complete voltage and
frequency scaling for these GPUs in nouveau. Gained knowledge could
benefit the implementation of newer generations of cards as well.
I limit myself to the core features and aim for a manual control of
the voltage and clock frequencies based on profiles in the VBIOS;
dynamic reclocking based on load information is beyond the scope of
this project.
Initial code contributions will not make use of Nouveaus PMU engine.
When established that this is absolutely necessary, the firmware could
be extended to support the desired functionality. However, until this
is established, reclocking through PDAEMON is considered a nice to
have feature with low priority.
Benefits to the community
Users will benefit from the increased performance that nouveau can
offer under higher clocks, while having the capability to save energy
when the processing power is not required. This could lead to
prolongued battery life for mobile systems using the Open Source
NVIDIA driver stack.
This work combined with the GSoC project on performance counters
provides the prerequisites for implementing dynamic frequency scaling
in future work, enabling all users of the open source graphics driver
stack to profit from these benefits without manual intervention.
Deliverables
Implementation will be done entirely in the Nouveau kernel module,
forked from an upstream kernel. Produced patches are intended to be
merged back into mainline kernel at the end of the project, but might
require some after-care when conflicting maintenance is done on
nouveau. Controls are exposed through sysfs.
Documentation will be added to the "envytools" GIT repository where
applicable.
Mentor
Ilia Mirkin
Schedule
My availability is roughly full time between now and the start of the
new academic year in October. Tentative planning:
Description Deliverable Timeframe Required
Reverse engineer seq ISA Documentation (envytools) 1 week X
Write seq script decoder Decoding tool (envytools) 1 week X
RE clock tree for NVA3/5/8 Documentation (envytools), full graph 1-2
week(s) X
Finish/fix engine reclocking for NVA3/5/8 Kernel code allowing users
to successfully select any performance level through SysFS 1 week X
RE+implement DDR3 link training unit Documentation (envytools) +
Kernel code (no directly visible changes) 1 week X
RE+implement DDR3 memory reclocking Kernel code, observable
performance improvements for highest performance level on affected
GPUs 3 weeks X
RE+implement GDDR3 memory reclocking Kernel code, observable
performance improvements for highest performance level on affected
GPUs 3 weeks X
RE+implement GDDR5 memory reclocking* Kernel code, observable
performance improvements for highest performance level on affected
GPUs ?
RE+implement DDR2 memory reclocking* Kernel code, observable
performance improvements for highest performance level on affected
GPUs ?
* If hardware available
Risks
There is little risk attached to all tasks resulting in documentation
of the clock tree. Patches to the nouveau kernel tree are expected,
but chances exist that the code does not generalise to all cards.
Earlier experience makes me confident engine reclocking can be
implemented with low risk. Achievements for memory reclocking are not
guaranteed given the complexity of the job, although progress is
definitely expected.
Hardware
I currently possess one NVA8 GPU with DDR3 memory. More NVA3/5/8
hardware is available through Martin Peres and accessible remotely.
Possibly missing in our combined collection are NVA3/5/8 graphics
cards with DDR2. If budget is available, this could be purchased (new
approximately €50,=) by either Martin Peres or myself for
reverse-engineering purposes.
_______________________________________________
[email protected]: X.Org Foundation Board of Directors
Archives: http://foundation.x.org/cgi-bin/mailman/private/board
Info: http://foundation.x.org/cgi-bin/mailman/listinfo/board
_______________________________________________
Nouveau mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/nouveau